How Machine Learning is Changing the Face of the Data Center

Machine learning and artificial intelligence have arrived in the data center, changing the face of the hyperscale server farm as racks begin to fill with ASICs, GPUs, FPGAs and supercomputers.

These technologies provide more computing horsepower to train machine learning systems, a process that involved enormous amounts of data-crunching. The end goal is to create smarter applications, and improve the services you already use every day.

“Artificial intelligence is now powering things like your Facebook Newsfeed,” said Jay Parikh, Global Head of Engineering and Infrastructure for Facebook. “It is helping us serve better ads. It is also helping make the site safer for people that use Facebook on a daily basis.”

“Machine Learning is transforming how developers build intelligent applications that benefit customers and consumers, and we’re excited to see the possibilities come to life,” said Norm Jouppi, Distinguished Hardware Engineer at Google.

Much of the computing power to create these services will be delivered from the cloud. As a result, cloud builders are adopting hardware acceleration techniques that have been common in high performance computing (HPC) and are making their way into the hyperscale computing ecosystem.

The race to leverage machine learning is led by the industry’s marquee names, including Google, Facebook and IBM. As usual, the battlefield runs through the data center, with implications for the major cloud platforms and chipmakers like Intel and NVIDIA.

Google Unveils TPU Hardware

Neural networks are computers that emulate the learning process of the human brain to solve new challenges, a process that requires lots of computing horsepower. That’s why the leading players in the field have moved beyond traditional CPU-driven servers and are now building systems that accelerate the work. In some cases, they’re creating their own chips.

Last week Google revealed the Tensor Processing Unit (TPU), a custom ASIC tailored for TensorFlow, an open source software library for machine learning that was developed by Google. An ASIC (Application Specific Integrated Circuits) is a chip that can be customized to perform a specific task. Recent examples of ASICs include the custom chips used in bitcoin mining. Google has used its TPUs to squeeze more operations per second into the silicon.

A custom rack in a Google data center packed with Tensor Processing Unit hardware for machine learning. (Photo: Google)

“We’ve been running TPUs inside our data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning,” writes Norm Jouppi, Distinguished Hardware Engineer, on the Google blog. “This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore’s Law).”

A board with a TPU fits into a hard disk drive slot in a data center rack. Google used its TPU infrastructure to power AlphaGo, the software program that defeated world Go champion Lee Sedol in a match. Go is a complex board game in which human players maintained an edge on computers, which long ago had overtaken the abilities of humans in games like chess or “Jeopardy.” The complexities of Go presented a challenge to artificial intelligence technology, but the extra power supplied by TPUs helped Google’s program solve more difficult computational challenges and defeat Sedol.

“Our goal is to lead the industry on machine learning and make that innovation available to our customers,” writes Jouppi. “Building TPUs into our infrastructure stack will allow us to bring the power of Google to developers across software like TensorFlow and Cloud Machine Learning with advanced acceleration capabilities.”

Big Sur GPUs power Facebook’s AI Infrastructure

Facebook’s AI Lab is using GPUs to bring more horsepower to bear on data-crunching for its artificial intelligence (AI) and machine learning platform.

“We’ve been investing a lot in our artificial intelligence technology,” said Parikh.

Facebook’s Big Sur is Open Rack-compatible hardware designed for AI computing at a large scale. (Image: Facebook)

The Big Sur system leverages NVIDIA’s Tesla Accelerated Computing Platform, with eight high-performance GPUs of up to 300 watts each, with the flexibility to configure between multiple PCI-e connections. Facebook has optimized these new servers for thermal and power efficiency, allowing them to operate them in the company’s data centers alongside standard CPU-powered servers.

The gains in performance and latency provided by Big Sur help Facebook process more data, dramatically shortening the time needed to train its neural networks.

“It is a significant improvement in performance,” said Parikh. “We’ve deployed thousands of these machines in a matter of months. It gives us the ability to drive this technology into more product use cases within the company.”

Intel Focuses on FPGAs

On the hardware front, NVIDIA is perhaps the leading beneficiary of the new focus on machine learning, which has boosted sales of its GPU technology to hyperscale players. But it’s not the only chipmaker targeting the machine learning market.

Intel recently began sampling a new module that combines its traditional CPUs with field programmable gate arrays (FPGAs), semiconductors that can be reprogrammed to perform specialized computing tasks. FPGAs are similar to ASICs in that they allow users to tailor compute power to specific workloads or applications, but FPGAs can be reprogrammed to new tasks.

Intel sees FPGAs as the key to designing a new generation of products to address emerging customer workloads in the data center sector. In 2015 it paid $16 billion to acquire Altera, a leading player in FPGAs and other programmable logic devices (PLDs) to automate industrial infrastructure.[clickToTweet tweet=”Intel sees FPGAs as the key to designing a new generation of products to address emerging customer workloads.” quote=”Intel sees FPGAs as the key to designing a new generation of products to address emerging customer workloads.”]

“We think FPGAs are very strategic,” said Raejeanne Skillern, GM of the Cloud Service Provider Business at Intel. “We’re doing a lot of development with OEMs and customers, and continuing to implement (FPGAs) into our roadmap.”

A particular focus for Intel is the “Super 7” group of cloud service providers that are driving hyperscale infrastructure innovation, which includes Amazon, Facebook, Google and Microsoft, along with Chinese hyperscale companies Alibaba, Baidu and Tencent. Intel projects that by 2020, more than 30 percent of cloud service provider nodes will be accelerated by FPGAs.

Beyond Jeopardy: IBM Takes Watson to Market

IBM is pursuing a different path with its push into artificial intelligence, which Big Blue refers to as “cognitive computing.” IBM is targeting enterprise users, and leading with Watson

The IBM Watson supercomputer became the poster child for artificial intelligence in 2011, defeating two human champions in a game of Jeopardy.

As the race to bring AI and machine learning to the mass market accelerates, IBM is seeking to keep Watson relevant with commercial offerings that show how AI can be used in the enterprise and public sector.

Watson consists of a collection of algorithms and software running on IBM’s Power 750 line of servers, and learns from data instead of being explicitly programmed to carry out instructions. IBM says Watson is the ideal tool to help companies make sense of Big Data.

“We are seeing massive, massive growth in the amount of data, and most of that is unstructured,” said Steven Abrams, Distinguished Engineer at IBM’s Thomas Watson Research Center. “Until now, it’s been hard to get our arms around that data and what we can do with it.”

At the recent DataCenterDynamics Enterprise event in New York, Abrams outlined how customers can use Watson to build applications. IBM is used to “large, transformative engagements” with enterprise clients, but is offering a subscription model in which clients can use Watson via cloud APIS (application programming interfaces).

At DataCenterDynamics Enterprise, a group of startups and analytics firms described how they were using IBM Watson’s “cognitive computing” capabilities to create applications. (Photo: Rich Miller)

“We have to make Watson available to the type of company that may not normally be able to do business with IBM, and give them access to Watson technology,” said Abrams.
“We’ve gotten to the point where the technology is much closer to the self-service model. We’re really focusing on developers. We’re focused on helping people go from 0 to 60 in much less time.”

On the DCD panel, several customers discussed how they are using Watson to build apps. Some examples:

PurpleForge trains Watson to provide quick answers to engineering and support questions. Watson absorbs details from textbooks and manuals and then builds that knowledge into a database that can be queried with natural language questions. “It lets you do things you never thought were possible,” said Brian Hurley, President and CEO of PurpleForge.
Data security startup SparkCognition uses Watson to collate vulnerability databases and lists, and use the data to provide real-time remediation advice to security professionals. Director of Business development Stuart Gillen says using Watson accelerates the time-to-solution of potential threat.
Equals 3 Media allows digital marketers to use big data analytics refine ad serving, matching to users and their interests. Equals 3 uses Watson “personality insights” service, which scans social media networks, mining them for signals that can help advertisers target their messaging. For example, the Watson personality profiles might market a car different ways to different users – touting performance and speed to an adventure-oriented single person, and safety features to a suburban mom.

The personality profiling is one aspect of machine learning that may test users’ comfort levels, one panelist noted, raising the specter of a “giant Watson in the sky that knows everything.”

Cloud Delivery Model

Whether it’s Watson or competing services, it’s clear that the cloud will be the primary delivery method for consumer-facing services that tap machine learning. Google, Microsoft and Amazon Web Services are all now offering fully managed cloud services that offer the ability to analyze data and build applications or services.

As a result, the hardware required to support machine learning will live primarily in hyperscale data centers, which are already highly customized for extreme efficiency and high density workloads. These services are relatively new, so it’s not yet clear whether cloud economics will favor keeping this services in third-party clouds, or it may make sense for end users to eventually shift these workloads to company-operated data centers. For cloud services, it typically requires significant scale before the economics shift in favor of a company-operated facility.

But for the data center community, the benefits of machine learning aren’t measured only by the volume of hardware. Google is using machine learning and artificial intelligence to wring even more efficiency out of its mighty data centers. Joe Kava, Vice President for Data Center Operations at Google, said the use of neural networks will allow Google to reach new frontiers in efficiency in its server farms, moving beyond what its engineers can see and analyze.

“Our data centers are very large and complex,” said Kava. “The sheer number of interactions and operating parameters makes it really impossible for us mere mortals to understand how to optimize a data center in real time. However, it really is pretty trivial for computers to crunch through all those scenarios and find the optimal settings.

“Over the past couple of years, we’ve developed these algorithms and we’ve trained them with billions of data points from all of our data centers all over the world,” said Kava. “Now we use this machine learning to help our teams visualize the data, so the operations teams can know how to set up the electrical and mechanical plants for the optimal settings on any given day.”

In early usage, the neural network has been able to predict Google’s Power Usage Effectiveness with 99.6 percent accuracy. Its recommendations have led to efficiency gains that appear small, but can lead to major cost savings when applied across a data center housing tens of thousands of servers.