Facebook Unveils New Hardware to Manage Data Center Traffic

March 21, 2018
Facebook has designed a distributed networking system to support the massive flows of data moving between its data centers. The debut of Fabric Aggregator lays the groundwork for even larger cloud campuses to come.

SAN JOSE, Calif. – The massive data traffic coursing through Facebook’s cloud campuses has outstripped the capabilities of commercial networking hardware. So the company has built its own distributed networking system to support its growing data needs – and to lay the groundwork for even larger cloud campuses to come.

At Tuesday’s Open Compute Summit 2018, Facebook unveiled the Fabric Aggregator, a new system to manage data traffic between its data centers. The company is donating the device design to the Open Compute Project (OCP), the open source hardware project which Facebook co-founded in 2011.

The development of Fabric Aggregator highlights how the super-sizing of data centers and cloud campuses has implications for network  equipment. As hyperscale companies like Facebook continue to grow, the huge volumes of user data prompt them to add data centers. This leads to larger cloud campuses, with truly massive volumes of data moving between them.

The volume of this “East-West” traffic between data centers far surpasses the volume of data traveling to other campuses and the Internet – known as “North-South” traffic.

Even Bigger Campuses Ahead?

As we noted yesterday, Facebook has been increasing the scale of its data center campuses. Since the beginning of 2017, Facebook has announced five new cloud campuses, and now has 12  around the globe, including nine in the U.S.

Early campuses featured two data centers, but the company has recently been building bigger.

“We’ve been moving to six buildings, creating a large increase in East-West traffic,” said Sree Sankar, Technical Product Manager at Facebook. “That required a big change.”

Facebook’s Sree Sankar describe’s how Fabric Aggregator manages traffic between data centers during a presentation at the Open Compute Summit Tuesday in San Jose, Calif. (Photo: Rich Miller)

The Fabric Aggregator is a distributed network system made up of a simple building block – Facebook’s Wedge 100 switch.

“Unfortunately using a large, general purpose network chassis no longer met our needs in terms of scale, power efficiency, and flexibility,” the Facebook Engineering Team said in a blog post. “Taking a disaggregated approach allows us to accommodate larger regions and varied traffic patterns, while providing the flexibility to adapt to future growth.”

What does “adapt to future growth” mean? One obvious possibility is even larger data center campuses, with more buildings. But the most important benefit of the new architecture is that Fabric Aggregator provides flexibility, allowing it to manage campuses with different numbers of facilities in a logical fashion.

Flexible Enough for Many Scenarios

The aggregator was designed so it can work within a single rack, or in a multi-rack configurations. Different flavors of the rack were designed to support various networking technologies that tie together the many layers of data center traffic.

“The ability to tailor different Fabric Aggregator node sizes in different regions allows us to use resources more efficiently, while having no internal dependencies keeps failures isolated, improving the overall reliability of the system,” Facebook said.

The Fabric Aggregator uses Wedge100S switches with Facebook Open Switching System (FBOSS) as the base building blocks, running Border Gateway Protocol (BGP) between all subswitches

“A building block approach gives us the ability to operate the solution at either the subswitch or node level,” the Facebook team writes. “For example, if we detect a misbehaving subswitch inside a particular node, we can take that specific subswitch out of service for debugging. If there is a need take all downstream and upstream subswitches out of service in a node, our operational tools abstract all the underlying complexities inherent to multiple interactions across many individual subswitches. We also implement redundancy at the node level so that we can take many nodes out of service simultaneously in a single region. The Fabric Aggregator layer can suffer many simultaneous failures without compromising the overall performance of the network.”

For additional technical details, see the Facebook Engineering Post on Fabric Aggregator.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Sponsored Recommendations

Optimizing AI Infrastructure: The Critical Role of Liquid Cooling

In this executive brief, we discuss the growing need for liquid cooling in data centers due to the increasing power demands of AI and high-performance computing. Discover how ...

AI-Driven Data Centers: Revolutionizing Decarbonization Strategies

AI hype has put data centers in the spotlight, sparking concerns over energy use—but they’re also key to a greener future. With renewable power and cutting-edge cooling, data ...

Bending the Energy Curve: Decoupling Digitalization Trends from Data Center Energy Growth

After a decade of stability, data center energy consumption is now set to surge—but can we change the trajectory? Discover how small efficiency gains could cut energy growth by...

AI Reference Designs to Enable Adoption: A Collaboration Between Schneider Electric and NVIDIA

Traditional data center power, cooling, and racks aren’t sufficient for GPU-based servers arranged in high-density AI clusters...

Courtesy of AFL
Source: Courtesy of AFL

Scaling Up and Scaling Out in AI Data Centers

Manja Thessin, Enterprise Market Manager for AFL, highlights the importance of industry collaboration across factors such as AI hardware innovation and modular infrastructure ...

White Papers

Get the Full Report

Using Simulation to Validate Cooling Design

April 21, 2022
Kao Data’s UK data center is designed to sustainably support high performance computing and intensive artificial intelligence. Future Facilities explores how CFD can validated...