Is A Data Center Disaster Inevitable? Lessons from the Nuclear Industry
PHOENIX – The data center industry needs to rethink its approach to risk, or it faces the potential of a headline-making outage that could bring regulation, according to a leading reliability expert.
Steve Fairfax, the President of MTechnology, said there are key lessons to be learned from the experience of another mission-critical business – the nuclear power industry. Fairfax was among the keynote speakers at last week’s 7×24 Exchange Fall Conference at the Marriott Desert Ridge in Phoenix, sharing a presentation titled “Three Mile Island, Chernobyl and Fukushima: Lessons for the Data Center Industry.”
“Our society is more dependent than ever on functioning data centers,” said Fairfax. “There have already been fatalities from disruptions to 911 and hospital systems. There are a lot more ways this could happen.”
The disasters in the nuclear industry provide a cautionary tale, and many of the miscues that led to the three meltdown incidents have also been seen in recent data center failures, said Fairfax, who does reliability consulting for both industries.
Fairfax called upon the data center industry to take this risk seriously, and work harder to address weaknesses in planning and operational practices that could lead to safety incidents. The downside of a high-profile disaster could include a regulatory response that would impact the entire industry, not just individual providers.
“Data centers are largely invisible to the public,” said Fairfax. “When the public learns that people have been hurt or killed by a hazard they’ve never heard about, they get angry.
“You’d be surprised what can happen when there is public outrage and politicians get involved,” Fairfax continued. “This is a lightly regulated industry. When there is a disaster and intense public outrage, you do not get judiciously applied regulations.”
More Mission-Critical Than Ever
Data centers have always been mission-critical facilities, housing IT systems that support 911 systems, police and fire communications, patient information for hospitals, and critical government infrastructure. And like it or not, social platforms like Facebook and Twitter have become important tools for public communication during emergencies.
Major concentrations of compute capacity in geographic clusters also present risk, according to Fairfax, creating management challenges for America’s already strained electrical system. “Our grid infrastructure is already subject to multiple stresses,” he said.
Fairfax isn’t the only one raising these concerns. The Infrastructure Masons have also noted the rising criticality of digital infrastructure, and the likelihood of increased scrutiny.
That’s why Fairfax sees important parallels between the nuclear industry and data centers. In his 7×24 presentation, Fairfax examined details of the disasters at Three Mile Island, Chernobyl and Fukushima, and compared them with three real but anonymized data center failures that illustrated familiar patterns in the failures.
Operational Factors in Failures
Preventive Maintenance
Preventive maintenance was implicated in a number of the facility failures. Fairfax has been a frequent critic of preventive maintenance contracts for data centers, which he says are often structured around the needs of the vendor, rather than the equipment. MTechnology has done extensive research on maintenance schedules, and asserts that too much maintenance can be disruptive to optimal configurations for reliable operations.
An example: During the Three Mile Island nuclear meltdown in 1979, a secondary feed water pump was disabled by manual valves closed during preventive maintenance.
Fairfax says each maintenance window creates three opportunities for introducing instability – when the system is taken out of service, during the maintenance operation, and when the system is placed back in service.
“There’s a lot of money to be made in data center maintenance,” said Fairfax. “There’s a powerful incentive to do too much. We have clients that are reducing the frequency and scope of preventive maintenance.”
Poor Planning
The nuclear disasters also underscored the importance of planning scenarios. This was a particular failure in the Fukushima nuclear disaster, as the waterfront facility was smashed by a 46-foot high tsunami from a magnitude 9 earthquake. The huge wave swamped the facility’s emergency backup generators, crippling the plant’s ability to provide cooling for the nuclear cores. The resulting release of radiation caused an emergency evacuation of 154,000 local residents.
“Everyone knew the earthquake and tsunami risk,” said Fairfax. “And yet the owners and operators did not prepare properly.”
A commission appointed by Japan’s parliament concluded in 2012 that Fukushima “was a profoundly man-made disaster – that could and should have been foreseen and prevented, (while) its effects could have been mitigated by a more effective human response.”
In designing the Fukushima facility, facility operator Tokyo Electric Power (TEPCO) looked only at earthquake and tsunami impact on their own facilities, not the public health risks. Fairfax said this approach holds lessons for the data center industry.
“Our current planning focuses on the risk to the data center itself,” said Fairfax. “There’s not a lot of work on mitigating risk for other parties. Risk is not just probability. It’s probability times consequence.”
Not Understanding the State of the Facility
Another challenge shared by all of the data center and nuclear plant failures was a lack of visibility into the true condition of the facility.
“Operators were unaware of the true state of the systems,” said Fairfax. “When something goes wrong, you can always make it worse.”
This was a particular problem at Chernobyl, where an accident occurred during a test to simulate a power outage. The test was delayed, and the night shift was not fully trained for the full test protocols. Proper procedures were not followed, creating unstable operating conditions. The testing continued, despite unexpected readings and improvised procedures to boost power using different configurations of the control rods.
The resulting explosion killed two plant employees, with 28 more dying of acute radiation poisoning within weeks. Large parts of the Ukraine were evacuated, and the cleanup cost $68 billion.
Fairfax also described a data center failure in which circuit breakers failed to coordinate properly. Power was restored in 11 minutes, but it took longer to get customer systems back online. The facility operators were best by irate customers who threatened cancellation of contracts, and demanded immediate restoration of system to “normal” configuration, Fairfax said.
The operators attempted switching systems back on five times, and had five unexpected outcomes.
“Once again, operators were unaware of the true state of the system,” said Fairfax. “Operator actions then contributed to the failure and its consequences.
“If you attempt something and it doesn’t work, that means the operator’s mental image of the facility is different from the actual state of the facility,” he said. “In these circumstances, the odds of making it better are close to zero, and the odds of making it worse are close to 100 percent.”
Responses to the Risk Challenge
Although staff errors contributed to some of these failures, Fairfax says the most important failures were systemic.
“Every one of these disasters was preventable, made worse by lack of preparation, and made worse by failing to understand the state of the system,” he said. “I’m not ‘talking about individual people making mistakes, I’m talking about systems that made it impossible for them to do their job properly.”
Fairfax challenged attendees at 7×24 – which included nearly 900 engineers and executives working in the data center field – to continue to deeply analyze risk and the consequences of failure. He offered several avenues for potential improvements.
One is technology. Fairfax noted that despite advances in virtual reality and 3D modeling, there is still no effective “full-on data center simulator.” The development of such a system could play a role in designing safer systems, bringing new depth to analyses of failure scenarios, system dependencies and ripple effects.
He also highlighted the role of the Institute of Nuclear Power Operations (NIPO), which was created in the wake of the Three-Mile Island disaster, and works to promote industry-wide safety and reliability excellence. It serves as a hub for training, accreditation and plant evaluations. Fairfax said that the NIPO helped change the nuclear industry, creating a culture in which reporting an anomaly is encouraged and required.
“This program has been very successful,” said Fairfax.
He closed with a prediction that the data center industry will have stronger safety and reliability performance, but reminded the audience that they will determine how that goal is realized.
“Change will come,” said Fairfax. “If we continue on the path we are currently on (there will be failures that cost lives), there will be an immense public outcry and we will be driven to change. Or we can take action and head this off.”