PHOENIX, Ariz. – Data center operators don’t love talking publicly about failure. Ranking high on the list of those unwelcome failures is arc flash.
That’s why it was notable when executives from Facebook and Schneider Electric took the stage at this week’s 7×24 Exchange Fall Conference to discuss two arc flash incidents at Facebook’s data center in Sweden, and how the company’s post-event analysis and response headed off other potential incidents.
An arc flash is an electrical explosion that generates intense heat that can reach 35,000 degrees F, which can damage and even melt electrical equipment. Arc flash incidents also represent a significant threat to worker safety.
“How we talk about failure is important,” said James Swensen, Senior Global Facilities Operations Manager at Facebook. “Both of our organizations were concerned about sharing this story. No one wanted us to show a picture.
“It’s uncomfortable,” he continued. “It’s not fun. But if you don’t go through that pain, you’re not going to get to the lessons learned.”
Arc Flash on Overhead Busway
The arc flash incidents occurred in 2014 and 2015 during live operation of Facebook’s first international data center in Lulea, Sweden. No one was injured in either incident, and since the events were limited to one of the facility’s four power rooms, the data center never went offline.
Even so, an arc flash is a dramatic event in a data center, where the power infrastructure handles enormous amounts of electricity. Each data hall within a Facebook facility houses tens of thousands of servers. As a result, the potential cost of mistakes is high.
The incidents in Lulea occurred in an overhead busway, an enclosure housing copper bars to conduct electricity, which distributes power within the data center. The 5,000 amp busway was assembled in sections that are each 10 feet long and weigh 700 pounds. These sections must be raised and pieced together and joined to the busway.
The complexity of the assembly was heightened by large number of parties involved in the install, according to Schneider Electric, which didn’t participate in the initial assembly but was brought in for the post-event analysis. “This isn’t a typical data center busway design,” said Bill Westbrock, Global Account Manager with Schneider Electric. “It’s a lot of power in a small space.”
At Issue: Torquing of Busway Joints
The analysis focused on the busway joints, where the sections came together. Schneider’s analysis found that the bolts that secure the joint were not torqued (tightened) properly during assembly. Over time, this created a “thermal avalanche” effect that eventually led the joint to fail. Here’s a diagram from the presentation:
Free Resource from Data Center Frontier White Paper Library
The analysis enabled two key actions. The first was a review of busway joints across the Lulea facility and all other Facebook data centers using this design.
“In a matter of hours and days, people in other buildings with similar busways were checking for the issues that we had found,” said Swensen.
The review revealed three other locations in Facebook facilities where the joint bolts were not properly torqued. Swensen says he is convinced this averted additional arc flash incidents.
Schneider’s root cause analysis found “many potential failure points” in a detailed analysis of how a joint bolt wound up not being properly torqued. Among the issues raised in the review was the inadequacy of periodic thermal scans designed to identify “hot spots” from emerging problems in busway joints.
In response, Schneider designed temperature sensors that could be deployed along the busway to provide real-time monitoring of temperature along the busway joints and issue alerts. Here’s an oveview:
Addressing Arc Flash in Data Centers
Arc flashes have led to expensive outages in data centers, most notably a 2009 incident at Fisher Plaza in Seattle, which knocked a major online payment gateway offline, slowing global e-commerce. The event was blamed on an insulation failure on a busway, and the building owner incurred $6.8 million in cash expenses related to the fire, including remediation and capital projects.
There were a series of arc flash incidents during the construction and testing of the NSA data center in Utah, according to media reports, which said the events caused significant equipment damage.
The larger danger with arc flash is worker safety. According to the American Society of Safety Engineers, more than 3,600 workers suffer disabling electrical contact injuries annually.
Reducing arc flash hazards has been a growing priority for data center power vendors and the National Fire Protection Association (NFPA), which in 2012 introduced new regulations designed to limit the scenarios in which technicians are working with energized equipment. It has also been a hot topic at industry conferences, which have highlighted strategies to reduce the risk of arc flash during testing of energized equipment.
Fixing the Problem, Not the Blame
Swensen said Facebook takes the issue of arc flash seriously, and wanted to share its findings with the industry.
“There are very important questions to ask,” he said. “If someone gets hurt because we didn’t take the right steps, what does that cost us as an organization? It falls right in with Facebook’s concept with Open Compute – if we share this, it helps the industry.”
The 7×24 Exchange is known for lively question and answer sessions. At the Facebook/Schneider presentation, a questioner noted that challenges in busway assembly and maintenance are well understood, and wondered whether Facebook had considered alternate power designs.
“We have made very fundamental changes (in busway design),” since the incidents, Swensen said, but didn’t offer details. Alternate methods include the use of cables, instead of busway, in power distribution.
Schneider Electric and Facebook said the key to effective after-incident reviews is addressing the “blame game” up front.
“I would argue that all too often the filter your have in mind is ‘who’s to blame,’” said Swensen. “This is where, as end users, it’s important to set the right tone and culture.”
“A lot of people are protecting their interests,” said Westbrock. “We said ‘let’s not focus on assessing the blame, but understand where the breakdown occurred. James said ‘there will be no blame assessed and no financial compensation sought.’”
That won’t always be easy, especially in events involving outages or injuries. But Westbrock said the importance of the issue calls for nothing less.
“There are going to be problems as we build these projects,” said Westbrock. “We’re looking at critical problems. In this industry, we all need to own up to this to ensure that we do better.”