Facebook: Open Sharing Was Key to Addressing Arc Flash Incidents

Oct. 19, 2017
At the 7×24 Exchange, Facebook disclosed two arc flash incidents at its data center in Sweden, and how its post-event analysis headed off other potential incidents.

PHOENIX, Ariz. – Data center operators don’t love talking publicly about failure. Ranking high on the list of those unwelcome failures is arc flash.

That’s why it was notable when executives from Facebook and Schneider Electric took the stage at this week’s 7×24 Exchange Fall Conference to discuss two arc flash incidents at Facebook’s data center in Sweden, and how the company’s post-event analysis and response headed off other potential incidents.

An arc flash is an electrical explosion that generates intense heat that can reach 35,000 degrees F, which can damage and even melt electrical equipment. Arc flash incidents also represent a significant threat to worker safety.

“How we talk about failure is important,” said James Swensen, Senior Global Facilities Operations Manager at Facebook. “Both of our organizations were concerned about sharing this story. No one wanted us to show a picture.

“It’s uncomfortable,” he continued. “It’s not fun. But if you don’t go through that pain, you’re not going to get to the lessons learned.”

Arc Flash on Overhead Busway

The arc flash incidents occurred in 2014 and 2015 during live operation of Facebook’s first international data center in Lulea, Sweden. No one was injured in either incident, and since the events were limited to one of the facility’s four power rooms, the data center never went offline.

Even so, an arc flash is a dramatic event in a data center, where the power infrastructure handles enormous amounts of electricity. Each data hall within a Facebook facility houses tens of thousands of servers. As a result, the potential cost of mistakes is high.

The incidents in Lulea occurred in an overhead busway, an enclosure housing copper bars to conduct electricity, which distributes power within the data center. The 5,000 amp busway was assembled in sections that are each 10 feet long and weigh 700 pounds. These sections must be raised and pieced together and joined to the busway.

The complexity of the assembly was heightened by large number of parties involved in the install, according to Schneider Electric, which didn’t participate in the initial assembly but was brought in for the post-event analysis. “This isn’t a typical data center busway design,” said Bill Westbrock, Global Account Manager with Schneider Electric. “It’s a lot of power in a small space.”

At Issue: Torquing of Busway Joints

The analysis focused on the busway joints, where the sections came together. Schneider’s analysis found that the bolts that secure the joint were not torqued (tightened) properly during assembly. Over time, this created a “thermal avalanche” effect that eventually led the joint to fail. Here’s a diagram from the presentation:

A diagram of how inadequate tightening of a bolt in a busway joint led to an arc flash. (Image: Facebook, Schneider Electric)

The analysis enabled two key actions. The first was a review of busway joints across the Lulea facility and all other Facebook data centers using this design.

“In a matter of hours and days, people in other buildings with similar busways were checking for the issues that we had found,” said Swensen.

The review revealed three other locations in Facebook facilities where the joint bolts were not properly torqued. Swensen says he is convinced this averted additional arc flash incidents.

Schneider’s root cause analysis found “many potential failure points” in a detailed analysis of how a joint bolt wound up not being properly torqued. Among the issues raised in the review was the inadequacy of periodic thermal scans designed to identify “hot spots” from emerging problems in busway joints.

In response, Schneider designed temperature sensors that could be deployed along the busway to provide real-time monitoring of temperature along the busway joints and issue alerts. Here’s an oveview:

Addressing Arc Flash in Data Centers

Arc flashes have led to expensive outages in data centers, most notably a 2009 incident at Fisher Plaza in Seattle, which knocked a major online payment gateway offline, slowing global e-commerce. The event was blamed on an insulation failure on a busway, and the building owner incurred $6.8 million in cash expenses related to the fire, including remediation and capital projects.

There were a series of arc flash incidents during the construction and testing of the NSA data center in Utah, according to media reports, which said the events caused significant equipment damage.

The larger danger with arc flash is worker safety. According to the American Society of Safety Engineers, more than 3,600 workers suffer disabling electrical contact injuries annually.

Reducing arc flash hazards has been a growing priority for data center power vendors and the National Fire Protection Association (NFPA), which in 2012 introduced new regulations designed to limit the scenarios in which technicians are working with energized equipment. It has also been a hot topic at industry conferences, which have highlighted strategies to reduce the risk of arc flash during testing of energized equipment.

Fixing the Problem, Not the Blame

Swensen said Facebook takes the issue of arc flash seriously, and wanted to share its findings with the industry.

“There are very important questions to ask,” he said. “If someone gets hurt because we didn’t take the right steps, what does that cost us as an organization?  It falls right in with Facebook’s concept with Open Compute – if we share this, it helps the industry.”

The 7×24 Exchange is known for lively question and answer sessions. At the Facebook/Schneider presentation, a questioner noted that challenges in busway assembly and maintenance are well understood, and wondered whether Facebook had considered alternate power designs.

“We have made very fundamental changes (in busway design),” since the incidents, Swensen said, but didn’t offer details.  Alternate methods include the use of cables, instead of busway, in power distribution.

Schneider Electric’s Bill Westbrock speaks during the 7×24 Exchange Fall Conference in Phoenix. (Photo: Rich Miller)

Schneider Electric and Facebook said the key to effective after-incident reviews is addressing the “blame game” up front.

“I would argue that all too often the filter your have in mind is ‘who’s to blame,’” said Swensen. “This is where, as end users, it’s important to set the right tone and culture.”

“A lot of people are protecting their interests,” said Westbrock. “We said ‘let’s not focus on assessing the blame, but understand where the breakdown occurred. James said ‘there will be no blame assessed and no financial compensation sought.’”

That won’t always be easy, especially in events involving outages or injuries. But Westbrock said the importance of the issue calls for nothing less.

“There are going to be problems as we build these projects,” said Westbrock. “We’re looking at critical problems. In this industry, we all need to own up to this to ensure that we do better.”

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Sponsored Recommendations

Guide to Environmental Sustainability Metrics for Data Centers

Unlock the power of Environmental, Social, and Governance (ESG) reporting in the data center industry with our comprehensive guide, proposing 28 key metrics across five categories...

The AI Disruption: Challenges and Guidance for Data Center Design

From large training clusters to small edge inference servers, AI is becoming a larger percentage of data center workloads. Learn more.

A better approach to boost data center capacity – Supply capacity agreements

Explore a transformative approach to data center capacity planning with insights on supply capacity agreements, addressing the impact of COVID-19, the AI race, and the evolving...

How Modernizing Aging Data Center Infrastructure Improves Sustainability

Explore the path to improved sustainability in data centers by modernizing aging infrastructure, uncovering challenges, three effective approaches, and specific examples outlined...

iStock, courtesy of AFL

Hyperscale: The AI Tsunami

AFL's Alan Keizer and Keith Sullivan explore how AI is driving change and creating challenges for data centers.

White Papers

Get the full report.
Get the full report.
Get the full report.
Get the full report.
Get the full report.

Focusing on Data Center Expertise

Feb. 19, 2022
A new paper from CBRE looks at the importance of outsourcing as a way of delivering real-world data center facility management success.