Uptime: Networks, Software Play Growing Role in Data Center Outages

Networking and software issues are emerging as two of the more common causes of data center outages, while power problems are becoming somewhat less of an issue, according to new data from Uptime Institute’s Annual Outage Analysis.

This trend is not surprising, given the growing role of cloud computing and SaaS (software as a service) applications, which increasingly use architectures that can route around physical failures of electrical components like UPS systems, transfer switches and generators. To be sure, power chain issues still have a major impact on downtime, as seen in several headline-grabbing incidents in early 2021.

“Overall, the causes of outages are changing,” said Andy Lawrence, executive director of research, Uptime Institute. “Software and IT configuration issues are becoming more common, while power issues are now less likely to cause a major IT service outage.”

Online services were more important than ever in 2020, as the COVID-19 pandemic and social distancing practices boosted remote work and learning. That meant that service outages were more broadly felt, and generated wider notice.

“Although there were significant disruptions affecting financial trading, government services, internet and telecom, the outages that made headlines in 2020 were often about the impact to consumers and workers at home, with interruptions to applications such as Microsoft Exchange and Teams, Zoom, fitness trackers and the like,” Uptime noted.

Outages Create More Concern, More Cost

The cost of data center outages goes beyond headlines and user complaints on social media. More than half of the respondents who reported an outage to Uptime in the past three years estimated its cost at more than $100,000, and almost a third reported costs of $1 million or above.

“Resiliency remains near the top of management priorities when delivering business services,” said Lawrence. “The fact is outages remain common and justify the increased concern and investment in preventing them. Because of the disruption and high costs that result from disrupted IT services, identifying and analyzing the root causes of failures is a critical step in avoiding more expensive problems.”

Some of the findings from Uptime’s 2020 survey include:

Almost half (44%) of data center operators surveyed think that concern about resiliency of data center/mission-critical IT has increased in the past twelve months.
Serious and severe outages are less common (one in six reported having one in the past three years) but can have catastrophic results for stakeholders. Vigilance and investment are necessary.
More than half (56%) of all organizations using a third-party data service have experienced a moderate or serious IT service outage in the last three years that was itself caused by the provider.

As Architecture Shifts, So Does Outage Culprits

The focus on third-party service performance accompanies the ongoing shift from on-premises data centers to the use of colocation facilities and cloud platforms, which has a positive impact on uptime, but also amplifies any failures in the networks and software automation that drive the cloud delivery model. (For more on this shift, see our DCF article “Rethinking Redundancy: Is Culture Part of the Problem”).

“This rise in outages caused by IT systems and network issues is due to the broad shift in recent years from siloed IT services running on dedicated, specialized equipment to an architecture in which more IT functions run on standard IT systems, often distributed or replicated across many sites,” Uptime says in its outage report. “As more organizations move to cloud-based, distributed IT (driven by a desire for greater agility and automation), the underlying data center infrastructure is becoming less of a focus or a single point of failure.

“This does not mean, however, that there is any case, at least at present, for de-emphasizing site-level resiliency or investing less,” Uptime added. “Site-level failures invariably cause major problems, regardless of whether distributed resiliency architectures are deployed.”

As the originator of the Tier System, which has long been used as a benchmark for reliability design, Uptime has an ongoing interest in equipment redundancy, a key focus for the tier ratings. As recent events have shown, power equipment continues to be central to uptime.

In March, an OVH data center in Strasbourg, France was destroyed by a fire. While no final analysis has been provided, early indications point to UPS units as the likely origin of the incident.
This month, an emergency generator caught fire at a WebNX data center in Ogden, Utah, causing the full shutdown of the data center and lengthy outages for customers.