In this week’s Voices of the Industry, Josh Moody, FORTRUST’s Senior Vice President of Sales and Marketing, explores the top causes for Data Center Downtime.
With so much riding on the support data centers provide, downtime is a serious – and often newsworthy – occurrence. Just last year, several service providers made headlines due to data center outages, showing that even the biggest names in technology aren’t immune to downtime.
According to InformationWeek, Google’s Cloud suffered an outage in early 2015 due to software issues. Later that year, Apple customers experienced a seven-hour outage that impacted 11 platforms, including iCloud Mail, Find My iPhone and iCloud Drive.
Early 2016 saw a power outage in a Verizon data center, which effectively downed JetBlue’s digital infrastructure. While the airline’s service level agreement (SLA) with Verizon noted guaranteed failover, JetBlue’s website and digital airport systems remained unavailable for around three hours.
These episodes only scratch the surface of the damage an unscheduled outage occurrence can cause. Here at FORTRUST, we pride ourselves on ensuring availability – in 2015, we passed the 14 year mark for continuous critical systems uptime at our Denver data center.
With the impact of data center downtime becoming more significant, it begs the question – what are the top causes of downtime in the data center? Let’s take a look:
What Causes Data Center Downtime?
FORTRUST Senior Vice President and General Manager Robert McClary noted in the eBook, “A Data Center Operations Guide for Maximum Reliability” that downtime can considerably reflect on a provider’s reputation.
“At the end of the day, the measuring stick for a data center is the number of unplanned downtime events and the track record of continuous uptime it has delivered,” McClary wrote.
Overall, McClary noted four main causes of unplanned data center downtime. These include:
- 1. Human error and poor infrastructure capacity management
- 2. Poor maintenance and lifecycle strategy
- 3. Substandard data center site selection and risk mitigation
- 4. All other causes
A 2013 study from the Ponemon Institute supports this. Overall, 83 percent of survey respondents were able to pinpoint the cause of an unplanned outage. Of these instances, 46 percent were caused by exceeding UPS capacity, 48 percent were the result of human error and 55 percent came due to UPS equipment failure.
Other causes of downtime can include utility outages, natural disasters, cyber attacks and a range of other problems that could interrupt regular data center processes.
Combating Top Causes of Downtime
Thankfully, there are factors clients can look for in providers that signal a commitment to uptime.
First and foremost, service providers must seek to combat the top cause of downtime: human error. One of the most effective ways to do this is with documented operational processes that lay out procedures across the entire facility. These can include procedures for standard operations, maintenance and contingency, as well as overall guidelines and methods. Each document procedure should define the process taking place, and include detailed information about how staff members should carry out those activities. Templates can also be helpful to ensure uniformity and consistency.
However, it’s not enough to simply create a library of procedures. Data center managers must ensure that staff members are disciplined, and are following each process as it is defined in the documentation. Procedures must be adequately disseminated among employees, and there should be multiple opportunities for review and validation of each activity.
Training is also a critical part of preventing human error. Each staff member should have an in-depth understanding of all data center processes. In this way, should one staff member be absent, another employee is effectively prepared to take over his or her role. It’s paramount that facility workers have the right skill set and continual training is one of the best ways to establish knowledgeable, capable staff.
The Importance of DCIM
An effective data center infrastructure management (DCIM) strategy is also critical in prevent unplanned outages. FORTRUST has implemented a range of procedures as part of our processes, which include capacity monitoring and management, threshold alerts and alarms, real-time power usage and predictive maintenance.
“DCIM can play a key role in infrastructure capacity management by providing real-time information across the critical systems infrastructure,” McClary wrote. “It can assist in the creation of templates used to provision capacity to the end-user without infringing on redundancies or having unbalanced loads.”
“DCIM as well as maintenance and lifecycle strategies can play a significant role in ensuring uptime.”
It’s also important that clients seek details about the provider’s maintenance and lifecycle strategies. An effective, comprehensive program here should include equipment and system inspections, predictive and preventative maintenance, as well as testing and corrective maintenance.
Finally, it’s critical that a client’s provider is not only aware of, but actively seeks to address, the top causes of unplanned outages. While it’s nearly impossible to predict the level of reliability within a provider’s data center, DCIM as well as robust maintenance and lifecycle strategies can play a significant role in ensuring uptime.
“The larger factors of high-availability and uptime are specific to people, process, operations, maintenance, lifecycle, risk mitigation strategies and the operation mindset that facilitates it,” McClary wrote.
Josh Moody is SVP of Sales and Marketing for FORTRUST. Josh joined FORTRUST in 2008. As Senior Vice President of Sales and Marketing, he is responsible for developing, managing and executing the overall sales strategy for FORTRUST. Josh brings more than 18 years of experience in the information technology field, including developing and executing strategic planning, management, customer relations, and vendor/partner relationships. Colorado Business Magazine identified Josh as one of the top 25 Most Powerful Sales Reps in Colorado in 2010. To find out more about FORTRUST’s track record for critical systems uptime and how its staff prevents unplanned outages, contact the company for a tour of its Denver data center.