The Cloud’s Worst Week: Will Outages Slow Enterprise Momentum?

Is cloud computing reliable? Can I trust Internet services with my mission-critical business workloads?

Those questions have always been central to the cloud conversation for enterprise IT operations. For some cloud prospects, last week’s cluster of outages for major Internet brands may raise fresh questions about the reliability of online service delivery.

Several of the Internet’s most widely-used platforms suffered prolonged outages, including Google Cloud, Apple Pay and iCloud, Facebook, Instagram, WhatsApp and the CloudFlare content delivery service, which impacted thousands of customer web sites.

Last week’s incidents follow several outages in June, including downtime for Google and Cloudflare (which was caused by a routing protocol error at Verizon).

This may have been the cloud’s worst week, but the sky is not falling. The outages were unrelated, with different root causes, all of which are known challenges for data center and cloud services providers. But the cluster of incidents come amid a period of rapid growth for cloud platforms, which are playing a more important role in the uptime of American business.

Some major service providers clearly have work to do on reliability, given that confidence remains a key selling point for mission-critical customers. The enterprise shift out of aging on-premises IT facilities has been a major factor in the growth of cloud and data center services. Analysts say the corporate sector is keen to get out of the data center business, citing the expense of maintaining their own facilities and a shortage of staff. But cloud platforms need to perform on all elements of their value proposition, which includes reliability as well as cost and agility.

The silver lining: New lessons are learned from every outage, and last week’s downtime will no doubt help several organizations sharpen their best practices.

Three Days of Downtime

Last week’s outages affected both cloud platforms and social networks. It’s fashionable to joke about downtime at social networks (“OMG, MY CAT PICTUREZ!!”). In reality, these platforms have become critical marketing channels for millions of small businesses and non-profits, as well as larger brands that are investing heavily in visual media.

On July 2, sites running on Cloudflare began returning 502 errors, caused by a massive spike in CPU utilization across Cloudflare’s global network of more than 140 data centers. “This CPU spike was caused by a bad software deploy that was rolled back,” reported Cloudflare CTO John Graham-Cumming, who said the problem was traced to a single misconfigured rule within the Cloudflare Web Application Firewall during a routine deployment. “We recognize that an incident like this is very painful for our customers,” Cumming added. “Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.”
Early on July 3, users of Facebook and Instagram reporting that many images were not loading properly, with many users reporting problems trying to upload new media. Facebook soon confirmed issues across its infrastructure. “Some people and businesses experienced trouble uploading or sending images, videos and other files on our apps and platforms,” Facebook said. The company shared no additional details. But the nature of the incident – in which status updates loaded, but images did not – prompted speculation about networking issues with Facebook’s storage platforms.
Also on July 3, Google Cloud Platform reported latency problems in the Eastern US, which were traced to a fiber cut. “The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1,” Google said. Local news reports suggest the issues was likely connected to a CenturyLink fiber cut near Charleston, South Carolina. Google operates a major East Coast cloud node at a campus in Goose Creek, S.C, about 20 miles north of Charleston.
On July 4, users of Apple Pay, iCloud and other Apple services reported downtime for more than 90 minutes. Internet analytics firm Thousand Eyes said the “packet loss appears to have been caused by a BGP route flap issue, where a routing announcement is made and withdrawn in quick succession.” A major Internet downtime incident on June 24 was also attributed to issues with the Border Gateway Protocol, a key routing protocol implemented across major networks.

The Cost of Downtime

We recently noted how cloud computing is changing how companies approach uptime, introducing architectures that create resiliency using software and network connectivity (See “Rethinking Redundancy”). This strategy, pioneered by cloud providers, is creating new ways of designing applications. Data center uptime has historically been achieved through layers of redundant electrical infrastructure, including uninterruptible power supply (UPS) systems and emergency backup generators.

The silver lining: New lessons are learned from every outage, and last week’s downtime will no doubt help several organizations sharpen their best practices.

Delivering uptime has always been the prime directive for data centers. The industry was created to ensure that mission-critical applications never go offline, but recent surveys show that many organizations continue to experience costly business interruptions.

A survey released in May by The Uptime Institute found that a third of respondents reported business impacts from infrastructure outages in the past year. More than 10 percent said their most recent outages had a measurable impact, citing more than $1 million in direct and indirect costs. But these calculations aren’t always simple.

“One of our findings is that organizations often don’t collect full financial data about the impact of outages, or if they do, it might take months for these to become apparent,” said Andy Lawrence. “Many of the costs are hidden, even if the outcry from managers and even non-paying customers is most certainly not. But cost is not a proxy for impact: even a relatively short and inexpensive outage at a big, consumer-facing service provider can attract negative, national headlines.”

Those headlines have player a role in the historic reluctance for many enterprises to place high-value applications in cloud environments, as seen in this slide from the Uptime report in May.

(Still) Courting The Enterprise

Enterprise users will be a big business driver once again in 2019. The shift to colo and cloud is not new or sexy, but it is a major driver of data center growth, along with new technology adoption. The cloud and colocation industries have reached a level of maturity that offers compelling value, breaking down the historic resistance to moving data offsite. The notion that in-house data centers are more secure than the cloud has been punctured by a series of corporate data compromises.

The Internet plays a vital role in business productivity. Downtime incidents tread on the narratives driving the digital transformation. It’s safe to say that the percentages in that Uptime chart will shift over time.

An accelerating factor in the cloud migration is the aging of on-premises facilities. Construction of these capital-intensive projects slowed dramatically after the financial crisis of 2008, and a growing number of companies are facing decisions about their infrastructure. The cost of building and maintaining data centers will likely continue to nudge corporations into the cloud for some time to come.