With Big Basin, Facebook Beefs Up its AI Hardware

Facebook is beefing up its high performance computing horsepower with Big Basin, an AI server powered by eight NVIDIA GPU accelerators. Big Basin was introduced at today’s Open Compute Summit.

Rich Miller

March 9, 2017

5 min read

Big Basin, the new machine learning AI server Facebook introduced today at the Open Compute Summit in Santa Clara, Calif. (Photo: Facebook)

SANTA CLARA, Calif. – Facebook is beefing up its high performance computing horsepower, enhancing its use of artificial intelligence to personalize your news feed.

Facebook introduced brawny new hardware to power its AI workloads today at the 2017 Open Compute Summit at the Santa Clara Convention Center. Known as Big Basin, the unit brings more memory to its GPU-powered data crunching. It’s a beefier successor to Big Sur, the first-generation Facebook AI server unveiled last July.

“With Big Basin, we can train machine learning models that are 30 percent larger because of the availability of greater arithmetic throughput and a memory increase from 12 GB to 16 GB,” said Kevin Lee, a Technical Program Manager at Facebook. “This enables our researchers and engineers to move more quickly in developing increasingly complex AI models that aim to help Facebook further understand text, photos, and videos on our platforms and make better predictions based on this content.”

Making Your Newsfeed Smarter

Big Sur and Big Basin play important roles in Facebook’s bid to create a smarter newsfeed for its 1.9 billion users around the globe. With this hardware, Facebook can train its machine learning systems to recognize speech, understand the content of video and images, and translate content from one language to another.

As leading tech companies push the boundaries of machine learning, they are often following a do-it-yourself approach to their HPC hardware. Google, Apple and Amazon have also created research labs to pursue faster and better AI capabilities. They have used different approaches to hardware, with Google opting for custom ASICs (application specific integrated circuits) for its machine learning operations.

Facebook has chosen to use NVIDIA graphics processing units (GPUs) for its machine learning hardware. Facebook has been designing its own hardware for many years, and In preparing to upgrade Big Sur, the Facebook engineering team gathered feedback from colleagues in Applied Machine Learning (AML), Facebook AI Research (FAIR), and infrastructure teams.

The Power of Disaggregation

For Big Basin, Facebook collaborated with QCT (Quanta Cloud Technology), one of the orginal design manufacturers (ODMs) that works closely with the Open Compute community. Big Basin features eight NVIDIA Tesla P100 GPU accelerators, connected using NVIDIA NVLink to form an eight-GPU hybrid cube mesh — similar to the architecture used by NVIDIA’s DGX-1 “supercomputer in a box.”

Big Basin features eight NVIDIA Tesla P100 GPU accelerators. It’s the successor to Big Sur, which used an earlier verson of NVIDIA’s GPU technology. (Photo: Facebook)

Big Basin offers an example of one of the key principles guiding Facebook’s hardware design – disaggregation. Key components are built using a modular design, separating the CPU compute from the GPUs, making it easier to integrate components as new technology emerges.

“For the Big Basin deployment, we are connecting our Facebook-designed, third-generation compute server as a separate building block from the Big Basin unit, enabling us to scale each component independently,” Lee writes in a blog post announcing Big Basin. “The GPU tray in the Big Basin system can be swapped out for future upgrades and changes to the accelerators and interconnects.”

Flexibility and Faster Upgrades

Big Basin is split into three main sections: the accelerator tray, the inner chassis, and the outer chassis. The disaggregated design allows the GPUs to be positioned directly in front of the cool air being drawn into the system, removing preheat from other components and improving the overall thermal efficiency of Big Basin.

There are multiple advantages to this disaggregated design, according to Eran Tal, and engineering manager at Facebook.

“The concept is breaking down and separating components, and creating the ability to select what solution you want at different levels of hardware,” said Tal. “It gives you a lot of flexibility in addressing design with a fast-changing workload. You can never know what you will need tomorrow.

“You’re maximizing efficiency and flexibility,” he added.

Two New Server Models

Facebook also introduced two new server designs, each representing the next generation of existing OCP designs.

Tioga Pass is the successor to Leopard, which is used for a variety of compute services at Facebook. Tioga Pass has a dual-socket motherboard, which uses the same 6.5” by 20” form factor and supports both single-sided and double-sided designs. The double-sided design, with DIMMs on both PCB sides, allows Facebook to maximize the memory capacity. The flexible design allows Tioga Pass to serve as the head node for both the Big Basin JBOG (Just a Bunch of GPUs) and Lightning JBOF (Just a Bunch of Flash).This doubles the available PCIe bandwidth when accessing either GPUs or flash.
Yosemite v2 is a refresh of Facebook’s Yosemite multi-node compute platform. The new server includes four server cards. Unlike Yosemite, the new power design supports hot service — servers can continue to operate and don’t need to be powered down when the sled is pulled out of the chassis for components to be serviced. With the previous design, repairing a single server prevents access to the other three servers since all four servers lose power.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Oracle’s Global AI Infrastructure Strategy Takes Shape with Bloom Energy and Digital Realty

From Brownfield to Breakthrough: Aligned Data Centers Extends Its AI-First Infrastructure Vision from Ohio to the Edge of Innovation

Sponsored

NECA Manual of Labor Rates Chart

Sponsored

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

Voices of the Industry

Source: Image courtesy of Colocation America

Sponsored

Recent Trends for Colocation Providers

Samantha Walters of Colocation America shares her thoughts on four trends she's seeing in the colocation space.

Sponsored

Reshaping Energy Supply for the Data Center Value Chain

Peter Huang, Global President - Data Center & Thermal Management at bp Castrol, explains why AI isn't just consuming more power, it's demanding better power systems.

With Big Basin, Facebook Beefs Up its AI Hardware

Making Your Newsfeed Smarter

The Power of Disaggregation

Flexibility and Faster Upgrades

Two New Server Models

About the Author

Rich Miller

Related

Oracle’s Global AI Infrastructure Strategy Takes Shape with Bloom Energy and Digital Realty

From Brownfield to Breakthrough: Aligned Data Centers Extends Its AI-First Infrastructure Vision from Ohio to the Edge of Innovation

NECA Manual of Labor Rates Chart

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

Voices of the Industry

Recent Trends for Colocation Providers

Reshaping Energy Supply for the Data Center Value Chain

Trending

Meta Builds a Nuclear Supply Chain for the AI Era

Edged US Builds Waterless, High-Density AI Data Center Campuses at Scale

JLL's 2026 Global Data Center Outlook: Navigating the AI Supercycle, Power Scarcity and Structural Market Transformation

Sponsored Picks

3 Strategies to Future-Proof the Sustainability of Your Data Center

8 Types of Electrical Conduit and Their Uses

NECA Manual of Labor Rates Chart