Designing for Density: AI Brings Denser Racks into Multi-Tenant Data Centers

The growing adoption of artificial intelligence (AI) and other high-density applications poses challenges for data center design and management. Average rack densities have been rising, but the trend is driven by pockets of powerful hardware that often must co-exist with standard workloads within an organization’s IT operations.

“We’ve seen a lot of demand for those AI and machine learning applications,” said Ben Coughlin, CEO at Colovore, a high-density hosting specialist in Santa Clara, Calif. “In the last 12 months we’ve seen a big pickup. Customers want more intelligence, which has led to a lot of innovation at the server level.”

The shift to non-CPU specialized computing will bring change in the data center industry, which must adapt to new form factors and higher rack densities. A key architectural challenge is positioning data-crunching AI racks to access large datasets in storage environments which may have very different cooling needs.

Coughlin says more companies are seeking to divide colocation workloads and requirements. “I think it’s becoming more common to see a split multi-colo architecture with general-purpose computing tethered to a machine learning infrastructure,” he said.

Here’s a closer look at some of the strategies data center users and service providers are using to help AI workloads co-exist with traditional infrastructure.

The Challenge of Mixed Density Environments

As we’ve noted often in recent years, the imminent arrival of large-scale high-density cooling – including liquid cooling solutions – is one of the longest-running predictions, and always seems to hover near the horizon. But the rise of high-density workloads for AI is nudging this trend ahead in a meaningful way.

The 2020 State of the Data Center report found that the average rack density increased by nearly an entire kilowatt (kW) from the previous year, jumping to 8.2 kW per rack, up from 7.3 kW in the 2019 report and 7.2 kW in 2018. About 68 percent of respondents reported that rack density has increased over the past three years, with 26 percent saying the increase was “significant.” 451 Research has reported similar findings.

A major factor in rising densities is the rise of data-crunching for artificial intelligence (AI). Powerful new hardware for AI workloads is packing more computing power into each piece of equipment, boosting the power density – the amount of electricity used by servers and storage in a rack or cabinet – and the accompanying heat. The challenge is that high-density racks can be disruptive to airflow and cooling management in a data center environment.

“A problem arises when high density 15 to 30-kW server racks are mixed in with mostly low and medium density racks in a data hall,” explained Kevin Facinelli, Group President for Nortek Data Center Cooling, in our recent Executive Roundtable. “Should the data center hall’s ambient temperature be lowered to accommodate the higher heat generation at a significant operational cost penalty? Or would direct liquid cooling, such as coolant distribution units (CDU) or rear door heat exchangers, efficiently accommodate the high-density racks, while leaving the ambient temperature higher for the low and medium density equipment?”

Historically, extreme workloads have often been placed in specialized facilities optimized for high-density, such as Colovore or San Diego’s ScaleMatrix. Some opt for immersion cooling systems, which have gained traction in HPC and bitcoin. A growing number of providers are using containment systems to offer more flexible management of density, such as Aligned, which is designed to allow customers to add power capacity by scaling in place without having to reconfigure existing infrastructure. Other providers are working to standardize rack designs to support higher-density AI hardware, both through in-house designs and by working closely with hardware specialists like NVIDIA.

As GPUs Get Denser, Will Split Architectures Emerge?

As the leading player in technical computing, NVIDIA has played a pivotal role in accelerating the adoption of AI, providing powerful data-crunching hardware to bring AI’s potential into a practical reality.

Rows of cabinets in a data center in Santa Clara, Calif. (Photo: Rich Miller)

“GPUs have become the de facto workforce at the compute layer,” said Coughlin. “We’ve seen a lot of innovation on the CPU side with dual CPUs from Intel and AMD.”

This can mean power usage per 1 rack unit (RU) of 500 to 600 watts for CPUs and 1 kilowatt for GPUs. “These are workloads that require a different environment,” he added. “We are reaching a point where data center environments really do matter.”

Colovore’s Santa Clara data center uses racks equipped with water-chilled rear-door cooling units. It’s among a number of data center operators that are turning to cooling doors to bring water to the rack to handle higher densities.

When it launched in 2014, Colovore built its data halls to support densities of 20 kilowatts a cabinet. It has since raised that target to 35 kW, and is experimenting with 50 kilowatt cabinets. that has meant a shift from a passive rear-door system to an active cooling unit that includes fans for faster airflow.

“We are seeing more interest in deploying water-cooled systems,” said Coughlin. “My sense is that there’s a growing awareness among the bigger operators that this trend is about to arrive. Not everyone needs cabinets at 35 kilowatts, but some may have architectures, with varying requirements. This is all bubbling up. and will certainly reach a tipping point.”

Coughlin expects that AI adoption will influence how data center developers build future campuses. “It is possible to see a majority of space being traditionally air-cooled, but have a section of the campus that is optimized for high density.

“The other piece that’s important is that distance is not your friend,” said Coughlin. “Literally milliseconds of performance can make a large difference, and that can translate into cabling lengths. We think about this a lot.”

For PathAI in Ashburn, AI Bridges Colo and Cloud

The need for proximity between compute and storage infrastructures is known as “data gravity” and is a strategic focus for Digital Realty. Last year the company created a DataHub design optimized for AI hardware and machine learning workloads. The Data Hub footprint is based on typical customer deployment scenarios on NVIDIA GPU configurations, and designed to integrate with enterprise storage solutions.

Digital Realty CTO Chris Sharp offers details on the PlatformDIGITAL vision. (Photo: Rich Miller)

One recent DataHub customer is PathAI, which used AI hardware in pathology research to develop new medicines for the treatment of diseases like cancer. PathAI is a tenant in Digital Realty’s campus in Ashburn, Virginia, a key data hub that offers access to many cloud platforms with infrastructure nearby.

“Despite our background as a cloud-native company, we realized that we could really benefit from a hybrid cloud infrastructure,” said Don O’Neill, Vice President of Engineering at PathAI, who said complex workloads are now “running three to four times faster in a hybrid data center environment than purely in the cloud – a critical differentiator that significantly improved the productivity of our machine learning engineers.”

For PathAI, extreme density is an everyday fact of life. “The biggest challenge is the power requirements of modern GPU nodes,” said O’Neill. “Also, because the clusters load is variable, the power demand peaks and falls throughout the day.”

The Ashburn deployment supports around 17.5kW of IT load for cabinets mixing GPU nodes alongside a number of CPU-only servers. Cold-aisle containment was used to manage the cooling, with blanking panels used to ensure efficient air flow to manage the heat from the GPU nodes.

Most importantly, in Ashburn cloud platforms are right in the neighborhood.

“All our data is stored permanently on AWS S3 so being within a few milliseconds of it at the Ashburn DRT facility was very important to us,” said O’Neill. The Ashburn deployment used high-performance local storage network featuring NVMe drives, a parallel file system, and a 100Gbps Infiniband network acts as a cache for data being actively used.

“This allows us to move large data sets from this cache to the GPU’s very quickly and improve performance and maximize GPU utilization, an important metric for our cluster,” said O’Neill.

AI- Ready Designs Abound

NVIDIA’s rollout of more powerful chips will offer new ways to harness AI in business and research, and bring lots of power-hungry GPU hardware in data centers. The DGX-Ready Data Center partners program offers colocation services in more than 120 data center locations for customers seeking facilities to host their DGX A100 infrastructure in “validated, world-class data center facilities,” the company says.

That includes at least 15 data center operators ready host customers using the latest version of DGX, the GPU-powered “supercomputer in a box.”

“The DGX-Ready Data Center program allows the world’s most innovative companies to think about solutions first without worrying about facility limitations,” said Donough Roche, SVP of Engineering and Client Services at STACK Infrastructure.