Google Outage Sharpens Focus on Cloud Network Reliability

Sunday’s major outage for the Google Cloud Platform disrupted YouTube and Gmail, as well as popular hosted services that rely on Google’s infrastructure, including Snapchat, Shopify and Pokemon Go.

Google said the downtime was caused by unusual network congestion in its operations in the Eastern U.S. In an incident report, Google said that YouTube measured a 10 percent drop in global views during the incident, while Google Cloud Storage measured a 30 percent reduction in traffic.

Google said the outage was caused by a configuration change that affected more servers than intended. “The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam,” the company said.

UPDATE: On June 18 Google suffered a widespread outage for Google Calendar, which lasted several hours..

Network problems in cloud platforms are complicated by the highly-automated nature of cloud platforms. These data traffic flows are designed to be large and fast and work without human intervention – which makes them hard to tame when humans intervene.

“Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes,” the company reported. “Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.”

Uptime is Not Simple or Easy

The outage serves as a reminder that service uptime remains a challenge for even the largest and most sophisticated Internet companies. Google has been a pioneer of “Internet-scale” computing technologies, defining many best practices in data center operations and site maintenance. Along with hyperscale cohorts like Microsoft and Amazon Web Services, Google has invested heavily in addressing most of the scenarios that cause outages.

When Google suffers downtime, it is usually because of problems with its network, as was the case Sunday. The event underscores how data networks are more important than ever in keeping popular services online.

Last month we noted how cloud computing is bringing change to how companies approach uptime, introducing architectures that create resiliency using software and network connectivity (See “Rethinking Redundancy”). This strategy, pioneered by cloud providers, is creating new ways of designing applications. Data center uptime has historically been achieved through layers of redundant electrical infrastructure, including uninterruptible power supply (UPS) systems and emergency backup generators.

Software, Network Take Center Stage

Cloud providers like Google have been leaders in creating failover scenarios that shift workloads across data centers, spreading applications and backup systems across multiple data centers, and using sophisticated software to detect outages and redirect data traffic to route around hardware failures and utility power outages. That includes the use of availability zones (AZs), clusters of data centers within a region that allow customers to run instances of an application in several isolated locations to avoid a single point of failure.

This strategy doesn’t completely solve the problem, but shifts it into a new framework, according to Andy Lawrence, executive director of research, Uptime Institute.

“The data center sector is finding it challenging to manage complexity,” said Lawrence, who noted that cloud-style software-based resiliency “carries a higher risk of business services performance issues due to highly leveraged network and orchestration requirements.

“Most organizations have hybrid infrastructure, with a computing platform that spans multiple cloud, co-location and enterprise environments,” Lawrence added. “This in turn, increases application and data access complexity. It’s an approach that has the potential to be very liberating – it offers greater agility and when deployed effectively, increased resiliency.”

The prominent role of the network in outages is seen in the Uptime Institute’s ninth annual Global Data Center Survey, which was released last week. Network failures were the second-most common source of outages, narrowly trailing on-site power outages.

Network failures are now the second-most common cause of data enter failures, following on-site power failures. (Source: Uptime Institute)

That’s why network capabilities are a key focus in the cloud computing wars, investing in dark fiber and subsea cables as well as introducing a steady stream of new features to help customers manage their networks. Facebook has developed custom hardware to manage the exploding volume of its internal network traffic, while external data tonnage is likely to soar with the growth of M2M data (machine-to-machine activity).
Our recent story on redundancy prompted a discussion on LinkedIn on the effectiveness of network-based resiliency strategies, including the question of whether they’re more effective than equipment redundacy, or just cheaper.

“We have heard for almost two decades that networks and now software successfully paper over the inevitable cracks in data center electrical and mechanical infrastructure,” said Chris Johnston, a data center consultant with lengthy uptime-focused executive experience at Syska Hennessy Group and EYP Mission Critical Facilities. “So far, these promises are unfulfilled. Hardly a week goes by that I don’t read about a major outage due to software or network problems. I await learning how these problems are solved.

“The cloud providers have all reduced the resiliency of their electrical and mechanical infrastructures for one reason – to cut capital and operating costs. My quibble is that network and software try to convince users that they can paper over the cracks, which they cannot consistently do. The electrical and cooling designers know that they cannot paper over the cracks, particularly with today’s trend toward reduced resiliency and budgets.”

Google and Amazon Web Services release public incident reports for major outages, which are often detailed and identify concrete plans to address weaknesses or failure points identified during the review. Outages also include credits for customers with service-level agreements (SLAs), which outline penalties for failure to meet promised uptime standards.

The Uptime Institute says that almost three-quarters of respondents to its recent survey says they are not placing their core mission-critical applications into a public cloud. A lack of visibility, transparency and accountability of public cloud services is a major issue for enterprises with mission-critical applications, according to the survey, in which 20 percent of all managers said they would be more likely to place workloads in a public cloud if they had more visibility into the provider’s operational resiliency.

“The vetting process for third-party services, such as cloud, should go beyond a contractual service level agreement, and should include greater dynamic visibility, direct support during service degradations, more accountability and a better cost recovery plan when problems occur,” Uptime said.