Google Outage Sharpens Focus on Cloud Network Reliability

When Google has an outage, it’s often due to network problems, as was the case Sunday. A look at the growing role of the network in data center and cloud uptime.

Rich Miller

June 3, 2019

6 min read

Add Us On Google

A Google Jupiter network switch. (Image: Google)

Sunday’s major outage for the Google Cloud Platform disrupted YouTube and Gmail, as well as popular hosted services that rely on Google’s infrastructure, including Snapchat, Shopify and Pokemon Go.

Google said the downtime was caused by unusual network congestion in its operations in the Eastern U.S. In an incident report, Google said that YouTube measured a 10 percent drop in global views during the incident, while Google Cloud Storage measured a 30 percent reduction in traffic.

Google said the outage was caused by a configuration change that affected more servers than intended. “The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam,” the company said.

UPDATE: On June 18 Google suffered a widespread outage for Google Calendar, which lasted several hours..

Network problems in cloud platforms are complicated by the highly-automated nature of cloud platforms. These data traffic flows are designed to be large and fast and work without human intervention – which makes them hard to tame when humans intervene.

“Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes,” the company reported. “Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.”

Uptime is Not Simple or Easy

The outage serves as a reminder that service uptime remains a challenge for even the largest and most sophisticated Internet companies. Google has been a pioneer of “Internet-scale” computing technologies, defining many best practices in data center operations and site maintenance. Along with hyperscale cohorts like Microsoft and Amazon Web Services, Google has invested heavily in addressing most of the scenarios that cause outages.

When Google suffers downtime, it is usually because of problems with its network, as was the case Sunday. The event underscores how data networks are more important than ever in keeping popular services online.

Last month we noted how cloud computing is bringing change to how companies approach uptime, introducing architectures that create resiliency using software and network connectivity (See “Rethinking Redundancy”). This strategy, pioneered by cloud providers, is creating new ways of designing applications. Data center uptime has historically been achieved through layers of redundant electrical infrastructure, including uninterruptible power supply (UPS) systems and emergency backup generators.

Software, Network Take Center Stage

Cloud providers like Google have been leaders in creating failover scenarios that shift workloads across data centers, spreading applications and backup systems across multiple data centers, and using sophisticated software to detect outages and redirect data traffic to route around hardware failures and utility power outages. That includes the use of availability zones (AZs), clusters of data centers within a region that allow customers to run instances of an application in several isolated locations to avoid a single point of failure.

This strategy doesn’t completely solve the problem, but shifts it into a new framework, according to Andy Lawrence, executive director of research, Uptime Institute.

“The data center sector is finding it challenging to manage complexity,” said Lawrence, who noted that cloud-style software-based resiliency “carries a higher risk of business services performance issues due to highly leveraged network and orchestration requirements.

“Most organizations have hybrid infrastructure, with a computing platform that spans multiple cloud, co-location and enterprise environments,” Lawrence added. “This in turn, increases application and data access complexity. It’s an approach that has the potential to be very liberating – it offers greater agility and when deployed effectively, increased resiliency.”

The prominent role of the network in outages is seen in the Uptime Institute’s ninth annual Global Data Center Survey, which was released last week. Network failures were the second-most common source of outages, narrowly trailing on-site power outages.

Network failures are now the second-most common cause of data enter failures, following on-site power failures. (Source: Uptime Institute)

That’s why network capabilities are a key focus in the cloud computing wars, investing in dark fiber and subsea cables as well as introducing a steady stream of new features to help customers manage their networks. Facebook has developed custom hardware to manage the exploding volume of its internal network traffic, while external data tonnage is likely to soar with the growth of M2M data (machine-to-machine activity).
Our recent story on redundancy prompted a discussion on LinkedIn on the effectiveness of network-based resiliency strategies, including the question of whether they’re more effective than equipment redundacy, or just cheaper.

“We have heard for almost two decades that networks and now software successfully paper over the inevitable cracks in data center electrical and mechanical infrastructure,” said Chris Johnston, a data center consultant with lengthy uptime-focused executive experience at Syska Hennessy Group and EYP Mission Critical Facilities. “So far, these promises are unfulfilled. Hardly a week goes by that I don’t read about a major outage due to software or network problems. I await learning how these problems are solved.

“The cloud providers have all reduced the resiliency of their electrical and mechanical infrastructures for one reason – to cut capital and operating costs. My quibble is that network and software try to convince users that they can paper over the cracks, which they cannot consistently do. The electrical and cooling designers know that they cannot paper over the cracks, particularly with today’s trend toward reduced resiliency and budgets.”

Google and Amazon Web Services release public incident reports for major outages, which are often detailed and identify concrete plans to address weaknesses or failure points identified during the review. Outages also include credits for customers with service-level agreements (SLAs), which outline penalties for failure to meet promised uptime standards.

The Uptime Institute says that almost three-quarters of respondents to its recent survey says they are not placing their core mission-critical applications into a public cloud. A lack of visibility, transparency and accountability of public cloud services is a major issue for enterprises with mission-critical applications, according to the survey, in which 20 percent of all managers said they would be more likely to place workloads in a public cloud if they had more visibility into the provider’s operational resiliency.

“The vetting process for third-party services, such as cloud, should go beyond a contractual service level agreement, and should include greater dynamic visibility, direct support during service degradations, more accountability and a better cost recovery plan when problems occur,” Uptime said.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Vertiv Launches OneCore Modular Data Center Platform for AI and HPC

AI’s Execution Era: Aligned and Netrality on Power, Speed, and the New Data Center Reality

Sponsored

Get in Touch: Conduit Solutions for Data Centers

Sponsored

NECA Manual of Labor Rates Chart

Voices of the Industry

Sponsored

Harmonic Distortion in Data Centers: A Power Quality Planning Guide

Rick Hombsch of MTE explains why power quality is not optional, it's foundational.

Source: iStock.com/gorodenkoff, Image courtesy of NEMA

Sponsored

The Wild West Data Center Industry Needs Standards

NEMA's Patrick Hughes explains why speed without a shared technical foundation is just risk masquerading as efficiency.

Google Outage Sharpens Focus on Cloud Network Reliability

Uptime is Not Simple or Easy

Software, Network Take Center Stage

About the Author

Rich Miller

Related

Vertiv Launches OneCore Modular Data Center Platform for AI and HPC

AI’s Execution Era: Aligned and Netrality on Power, Speed, and the New Data Center Reality

Get in Touch: Conduit Solutions for Data Centers

NECA Manual of Labor Rates Chart

Voices of the Industry

Harmonic Distortion in Data Centers: A Power Quality Planning Guide

The Wild West Data Center Industry Needs Standards

Trending

Emerging Power Strategies Transforming AI Data Center Development

Innovations in Offshore Data Centers: Chinese Undersea Deployments, US Floating Platforms, and Future Prospects

From Components to AI Factories: Peter Panfil Says the Future of Data Centers Is All About Integration at Scale

Sponsored Picks

Data Center Product Family Guide

Electrical Conduit Comparison Chart

An Innovative Data Center Build-Out