Metaverse Platform Roblox Adds Data Center to Address 73-Hour Outage

Jan. 24, 2022
Gaming and metaverse platform Roblox is expanding its infrastructure to address a 73-hour outage in October that left its 50 million daily users offline. The outage highlights the growing focus on infrastructure complexity and uptime.

Online gaming and metaverse platform Roblox is expanding its infrastructure in the wake of a 73-hour outage in October that left its 50 million daily users offline.

The company will add a data center and expand its availability zones, hoping to address the October downtime. The company’s CEO said a key factor in the outage was “the growth in the number of servers in our data centers.” The downtime cost Roblox an estimated $25 million in lost bookings, the company said.

The outage was reviewed in an incident report released last week, which outlined how several software services contended for resources, making it harder to diagnose a bug in a database. The incident illustrates how the growing complexity of online applications can sometimes make it harder to trouble-shoot automated infrastructure, leading to lengthier outages.

The Roblox downtime was one of several extended cloud-level outages in 2021, which are driving a renewed focus on reliability engineering for complex infrastructures. DCF highlighted this issue in our 2022 Forecast, noting that “uptime is becoming more complex, requiring backup and failover strategies that span cloud, colo, on-premise facilities and edge infrastructure.”

The expansion by Roblox also underscores how metaverse-style applications will rely on significant amounts of infrastructure – a reality that could generate additional demand for digital infrastructure like data centers and network connectivity, as was noted in our recent Data Center Executive Roundtable (The Metaverse Will Need A Lot of Data Centers).

The Challenges of Growth

Roblox is an online platform for games and virtual experiences, which is available across multiple OSes and devices. Along with Minecraft and Fortnite, it has been cited as a early example of a metaverse – a collection of virtual worlds, landscapes and characters available through an immersive online environment.

Roblox is free to play and download, but operates a large in-world economy based on the Robux currency. More than 9.5 million developers have deployed games and apps, and can make money by selling items (such as clothing or avatars) in online storefronts. Roblox went public through an IPO last year, and had $509 million in revenue in the third quarter of 2021.

That’s why the lengthy outage in October became a significant business event for Roblox, which reported that the outage led to $25 million in lost revenue, and prompted the company to make $6.8 million in credits to developers as compensation for lost sales.

“This was not due to any peak in external traffic or any particular experience,” founder and CEO David Baszucki wrote on Oct 31. “Rather the failure was caused by the growth in the number of servers in our data centers. The result was that most services at Roblox were unable to effectively communicate and deploy.”

Roblox operates several data centers  with more than 18,000 servers, which support more than 170,000 software containers. The company uses a software suite from HashiCorp to manage its infrastructure. The software issues that contributed to the Roblox outage are complex, but one clear response was the need for more diverse infrastructure.

“Running all Roblox backend services on one Consul (service mesh) cluster left us exposed to an outage of this nature,” Roblox engineer Daniel Sturman in a detailed blog post. “We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services.

“We have efforts underway to move to multiple availability zones within these data centers,” he added. “We have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.”

Within the last month, Roblox has advertised for a new data center manager position based in Ashburn, the leading cloud and connectivity hub in Northern Virginia.

Uncovering A ‘Pathological Performance Issue’

As is the case in many extended outages, the Roblox issues were difficult to diagnose due to confusion about the root cause of the problems. This passage in the incident report provides a high-level description.

“The root cause was due to two issues. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication.

  • A single Consul cluster supporting multiple workloads exacerbated the impact of these issues.
  • Challenges in diagnosing these two primarily unrelated issues buried deep in the Consul implementation were largely responsible for the extended downtime.
  • Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.

For additional technical details, see the Roblox incident report.

As the service continues to improve and expand its infrastructure, it continues to experience periodic shorter outages, most recently on Saturday.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Sponsored Recommendations

Tackling Utility Project Challenges with Fiberglass Conduit Elbows

Explore how fiberglass conduit elbows tackle utility project challenges like high costs, complex installations, and cable damage. Discover the benefits of durable, cost-efficient...

How Deep Does Electrical Conduit Need to Be Buried?

In industrial and commercial settings conduit burial depth can impact system performance, maintenance requirements, and overall project costs.

Understanding Fiberglass Conduit: A Comprehensive Guide

RTRC (Reinforced Thermosetting Resin Conduit) is an electrical conduit material commonly used by industrial engineers and contractors.

NECA Manual of Labor Rates Chart

See how Champion Fiberglass compares to PVC, GRC and PVC-coated steel in installation.

Cadence Design Systems
Source: Cadence Design Systems

Implementing AI Safely and Sustainably in Data Centers

Mark Fenton, Product Engineering Director at Cadence Design Systems, explains why tools like digital twins will be essential for data centers to meet AI goals sustainably.

White Papers

Get the full report

Achieving Energy Efficiency Goals in Data Centers

April 15, 2022
One of the challenges global data centers currently face is the need to meet the increased processing and storage needs of their customers while also making their operations more...