Facebook Unveils New Hardware to Manage Data Center Traffic

Facebook has designed a distributed networking system to support the massive flows of data moving between its data centers. The debut of Fabric Aggregator lays the groundwork for even larger cloud campuses to come.

Rich Miller

March 21, 2018

4 min read

A look at four flavors of the Facebook Fabric Aggregator, handling traffic flows at multiple levels of its data center architecture. (Image: Facebook)

SAN JOSE, Calif. – The massive data traffic coursing through Facebook’s cloud campuses has outstripped the capabilities of commercial networking hardware. So the company has built its own distributed networking system to support its growing data needs – and to lay the groundwork for even larger cloud campuses to come.

At Tuesday’s Open Compute Summit 2018, Facebook unveiled the Fabric Aggregator, a new system to manage data traffic between its data centers. The company is donating the device design to the Open Compute Project (OCP), the open source hardware project which Facebook co-founded in 2011.

The development of Fabric Aggregator highlights how the super-sizing of data centers and cloud campuses has implications for network equipment. As hyperscale companies like Facebook continue to grow, the huge volumes of user data prompt them to add data centers. This leads to larger cloud campuses, with truly massive volumes of data moving between them.

The volume of this “East-West” traffic between data centers far surpasses the volume of data traveling to other campuses and the Internet – known as “North-South” traffic.

Even Bigger Campuses Ahead?

As we noted yesterday, Facebook has been increasing the scale of its data center campuses. Since the beginning of 2017, Facebook has announced five new cloud campuses, and now has 12 around the globe, including nine in the U.S.

Early campuses featured two data centers, but the company has recently been building bigger.

“We’ve been moving to six buildings, creating a large increase in East-West traffic,” said Sree Sankar, Technical Product Manager at Facebook. “That required a big change.”

Facebook’s Sree Sankar describe’s how Fabric Aggregator manages traffic between data centers during a presentation at the Open Compute Summit Tuesday in San Jose, Calif. (Photo: Rich Miller)

The Fabric Aggregator is a distributed network system made up of a simple building block – Facebook’s Wedge 100 switch.

“Unfortunately using a large, general purpose network chassis no longer met our needs in terms of scale, power efficiency, and flexibility,” the Facebook Engineering Team said in a blog post. “Taking a disaggregated approach allows us to accommodate larger regions and varied traffic patterns, while providing the flexibility to adapt to future growth.”

What does “adapt to future growth” mean? One obvious possibility is even larger data center campuses, with more buildings. But the most important benefit of the new architecture is that Fabric Aggregator provides flexibility, allowing it to manage campuses with different numbers of facilities in a logical fashion.

Flexible Enough for Many Scenarios

The aggregator was designed so it can work within a single rack, or in a multi-rack configurations. Different flavors of the rack were designed to support various networking technologies that tie together the many layers of data center traffic.

“The ability to tailor different Fabric Aggregator node sizes in different regions allows us to use resources more efficiently, while having no internal dependencies keeps failures isolated, improving the overall reliability of the system,” Facebook said.

The Fabric Aggregator uses Wedge100S switches with Facebook Open Switching System (FBOSS) as the base building blocks, running Border Gateway Protocol (BGP) between all subswitches

“A building block approach gives us the ability to operate the solution at either the subswitch or node level,” the Facebook team writes. “For example, if we detect a misbehaving subswitch inside a particular node, we can take that specific subswitch out of service for debugging. If there is a need take all downstream and upstream subswitches out of service in a node, our operational tools abstract all the underlying complexities inherent to multiple interactions across many individual subswitches. We also implement redundancy at the node level so that we can take many nodes out of service simultaneously in a single region. The Fabric Aggregator layer can suffer many simultaneous failures without compromising the overall performance of the network.”

For additional technical details, see the Facebook Engineering Post on Fabric Aggregator.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

DoD Taps 8 Nuclear SMR Vendors in Push to Deploy On-Site Microreactors: Data Center Energy Implications

Vertiv Launches OneCore Modular Data Center Platform for AI and HPC

Sponsored

NECA Manual of Labor Rates Chart

Sponsored

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

Voices of the Industry

Sponsored

6 Ways to Regain Control of Cloud Costs

Mastering cloud expenditure is vital for businesses of all sizes. Matt Powers of Wesco outlines six strategies to help you take control of your cloud spending.

Sponsored

The 2 AM Test: Why Serviceability Determines AI Infrastructure Success

Chris Hillyer, nVent's Director of Global Professional Services, explains why uptime is achieved through equipment designed to be serviced quickly when imperfection in high-density...

Facebook Unveils New Hardware to Manage Data Center Traffic

Even Bigger Campuses Ahead?

Flexible Enough for Many Scenarios

About the Author

Rich Miller

Related

DoD Taps 8 Nuclear SMR Vendors in Push to Deploy On-Site Microreactors: Data Center Energy Implications

Vertiv Launches OneCore Modular Data Center Platform for AI and HPC

NECA Manual of Labor Rates Chart

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

Voices of the Industry

6 Ways to Regain Control of Cloud Costs

The 2 AM Test: Why Serviceability Determines AI Infrastructure Success

Trending

Cooling Consolidation Hits AI Scale: LiquidStack, Submer, and the Future of Data Center Thermal Strategy

Rethinking Water in the AI Data Center Era

AI’s New Land Grab: Meta’s Indiana Megaproject and the Rise of Europe’s Neocloud Challengers

Sponsored Picks

Bending the Energy Curve: Decoupling Digitalization Trends from Data Center Energy Growth

Improving speed to market – A critical differentiator for data center operators

The modular solution to the AI infrastructure challenge