Inside Amazon’s Cloud Computing Infrastructure

As cloud computing has emerged as the new paradigm for computing at scale, Amazon has firmly established itself as the dominant player. After effectively creating the public cloud market with the launch of Amazon Web Services in 2006, the retailer has built AWS into a $6 billion a year business.

Along the way, Amazon’s infrastructure has become critical to the uptime of more than 1 million customers. That’s why an outage at Amazon can create ripples across popular sites like Netflix, Reddit, Tinder and IMdB, which was the case on Sunday when Amazon experienced problems at a data center in Virginia.

This week we’ll look at Amazon’s mighty cloud infrastructure, including how it builds its data centers and where they live (and why).

Amazon operates at least 30 data centers in its global network, with another 10 to 15 on the drawing board. Amazon doesn’t disclose the full scope of its infrastructure, but third-party estimates peg its U.S. data center network at about 600 megawatts of IT capacity.

Leading analysts view Amazon as the dominant player in public cloud. “AWS is the overwhelming (cloud computing) market share leader, with more than five times the compute capacity in use than the aggregate total of the other 14 providers,” writes IT research firm Gartner in its assessment of the cloud landscape.

Lifting the Veil of Secrecy … A Bit

Amazon has historically been secretive about its data center operations, disclosing far less about its infrastructure than other hyperscale computing leaders such as Google, Facebook and Microsoft. That has begun to change in the last several years, as Amazon executives Werner Vogels and James Hamilton have opened up about the company’s data center operations at events for the developer community.

“There’s been quite a few requests from customers asking us to talk a bit about the physical layout of our data centers,” said Werner Vogels, VP and Chief Technology Office for Amazon, in a presentation at the AWS Summit Tel Aviv in July. “We never talk that much about it. So we wanted to lift up the secrecy around our networking and data centers.”

A key goal of these sessions is to help developers understand Amazon’s philosophy on redundancy and uptime. The company organizes its infrastructure into 11 regions, each containing a cluster of data centers. Each region contains multiple Availability Zones, providing customers with the option to mirror or back up key IT assets to avoid downtime. The “ripple effect” of outages whenever AWS experiences problems indicates that this feature remains underutilized.

Scale Drives Platform Investment

In its most recent quarter, the revenue for Amazon Web Services was growing at an 81 percent annual rate. That may not translate directly into a similar rate of infrastructure growth, but one thing is certain: Amazon is adding servers, storage and new data centers at an insane pace.

“Every day, Amazon adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7 billion annual revenue enterprise,” said James Hamilton, Distinguished Engineer at Amazon, who described the AWS infrastructure at the Re:Invent conference last fall. “There’s a lot of scale. That volume allows us to reinvest deeply into the platform and keep innovating.”

Amazon’s data center strategy is relentlessly focused on reducing cost, according to Vogels, who noted that the company has reduced prices 49 times since launching Amazon Web Services in 2006.

Amazon CTO Werner Vogels (Image: YouTube)

“We do a lot of infrastructure innovation in our data centers to drive cost down,” Vogels said. “We see this as a high-volume, low-margin business, and we’re more than happy to keep the margins where they are. And then if we have a lower cost base, we’ll hand money back to you.”[clickToTweet tweet=”Amazon’s Werner Vogels: We see (cloud computing) as a high-volume, low-margin business.” quote=”Amazon’s Werner Vogels: We see (cloud computing) as a high-volume, low-margin business.”]

A key decision in planning and deploying cloud capacity is how large a data center to build. Amazon’s huge scale offers advantages in both cost and operations. Hamilton said most Amazon data centers house between 50,000 and 80,000 servers, with a power capacity of between 25 and 30 megawatts.

“In our view, this is around the right number, and we’ve chosen to build this number for an awfully long time,” said Hamilton. “We can build bigger. The thing is, the early advantages of scale are huge, but there’s a point where these advantages go down. A really huge data centers is only marginally less expensive per rack than a medium sized data center.”

How Big is Too Big?

As data centers get bigger, they represent a larger risk as a component of the company network.

“It’s undesirable to have data centers that are larger than that due to what we call the ‘blast radius’,” said Vogels, noting the industry term for assessing risk based on a single destructive regional event. “A data center is still a unit of failure. The larger you built your data centers, the larger the impact such a failure could have. We really like to keep the size of data centers to less than 100,000 servers per data center.”

So how many servers does Amazon Web Services run? The descriptions by Hamilton and Vogels suggest the number is at least 1.5 million. Figuring out the upper end of the range is more difficult, but could range as high as 5.6 million, according to calculations by Timothy Prickett Morgan at the Platform.[clickToTweet tweet=”Amazon’s Werner Vogels: We like to keep the size of data centers to less than 100,000 servers ” quote=”Amazon’s Werner Vogels: We like to keep the size of data centers to less than 100,000 servers “]

Amazon leases buildings from a number of wholesale data center providers, including Digital Realty Trust and Corporate Office Properties Trust. In the past the company typically leased existing properties such as warehouses and then renovated them for data center use. In recent years Amazon has begun focusing on new construction, which provides a “greenfield” that can be customized to support all elements of its designs, from the grid to the server. In Oregon, Amazon has used pre-fabricated “modular” data center components to accelerate its expansion.

An interesting element of Amazon’s approach to data center development is that it has the ability to design and build its own power substations. Tha specialization is driven by the need for speed, rather than cost management.

“You save a tiny amount,” said Hamilton. “What’s useful is that we can build them much more quickly. Our growth rate is not a normal rate for utility companies. We did this because we had to. But it’s cool that we can do it.”

Custom Servers and Storage

In the early days of its cloud platform, Amazon bought its servers from leading vendors. One of its major providers was Rackable Systems, an early player in innovative cloud-scale server designs. Amazon bought $86 million in servers from Rackable in 2008, up from $56 million a year earlier.

But as its operations grew, Amazon followed the lead of Google and began creating custom hardware for its data centers. This allows Amazon to fine-tune its servers, storage and networking gear to get the best bang for its buck, offering greater control over both performance and cost.

“Yes, we build our own servers,” said Vogels. “We could buy off the shelf, but they’re very expensive and very general purpose. So we’re building custom storage and servers to address these workloads. We’ve worked together with Intel to make household processors available that run at much higher clockrates. It allows us to build custom server types to support very specific workloads.”

(Image: James Hamilton, Amazon Web Services)

Amazon offers several EC2 instance types featuring these custom chips, a souped-up version of the Xeon E5 processor based on Intel’s Haswell architecture and 22-nanometer process technology. Different configurations offer optimizations for compute intensive, memory intensive of IOPS intensive applications.

“We know how to build servers to a certain specification, and as a consequence the processors can be pushed harder.” said Hamilton.

AWS uses designs its own software and hardware for its networking, which is perhaps the most challenging component of its infrastructure. Vogels said servers still account for the bulk of data center spending, but while servers and storage are getting cheaper, the cost of networking has gone up.

The Speed of Light Versus the Cloud

The “speed of light factor” in networking plays a significant role in how Amazon designs its infrastructure.

“The way most customers work is that an application runs in a single data center, and you work as hard as you can to make the data center as reliable as you can, and in the end you realize that about three nines (99.9 percent uptime) is all you’re going to get,” said Hamilton. “As soon as you get a high-reliability app, you run it in two data centers. Usually they’re a long way apart, so the return trip is very long. It’s excellent protection against a rare problem.”

“Building distributed development across multiple data centers, especially if they’re geographically further away, becomes really hard,” said Vogels.

The answer was Availability Zones: clusters of data centers within a region that allow customers to run instances in several isolated locations to avoid a single point of failure. If customers distribute instances and data across multiple Availability Zones (AZs) and one instance fails, the application can be designed so that an instance in another Availability Zone can handle requests. Each region has between two and six Availability Zones.

Amazon initially said little about the physical layout of AZs, leaving some customers to whether they might be different data halls within the same facility. The company has since clarified that each availability zone resides in a different building.

“It’s a different data center,” said Hamilton. “We don’t want to run out of AZs, so we add data centers.”

To make this work, the Availability Zones need to be isolated from one another, but close enough for low-latency network connections. Amazon says its zones are typically 1 to 2 milliseconds apart, compared to the 70 milliseconds required to move traffic from New York to Los Angeles.

“We’ve decided to place AZs relatively close together,” said Vogels. “However, they need to be in a different flood zone and a different geographical area, connected to different power grids, to make sure they are truly isolated from one another.”

NEXT: We look at the geography of Amazon’s cloud computing data center network.