Hotter Hardware: Rack Densities Test Data Center Cooling Strategies
It’s unlikely to come as a surprise to anyone that rack densities in data centers have been increasing. The increase in computing power for the average server, along with the concurrent increase in power demand for CPUs and GPUs has been pretty easy to track. But two recent changes to the IT workloads commonly found in data centers have driven the demand for power and cooling exponentially higher.
The first has been a long time developing, giving data centers the opportunity to grow resources for power and cooling to accommodate it. That is the effective commoditization of high performance computing resources (HPC). While HPC was once almost strictly the domain of supercomputers and scientific computing tasks, the relatively inexpensive ability to build clusters of high performance server CPUs and GPUs has moved the bar. Now businesses can either invest in the hardware necessary or cloud HPC services in order to address far more common use cases; effectively any use case that requires the processing of large volumes of data, executing complex simulations, or simply solving problems more quickly and efficiently.
The steady growth in the demand for HPC services has lead to, among other things, dedicated HPC data center development where the facilities were already building to the high-end of the power density spectrum and leading the way on development of rack power density deployments capable of supporting 50kW per rack development, more than an order of magnitude more power capability than was found in a typical data center less than a decade ago.
While data centers have had time to adapt to the growth in demand for HPC services, the second change has ramped up incredibly quickly. Seemingly out of nowhere, generative AI is being heralded as the solution to every computing task and problem. And the processes and workflows that make up AI services, such as training, machine learning, and inference engines, are compute intensive and power hungry. And the power demands and accompanying cooling needs are increasing exceptionally quickly.
As an example, Nvidia, in its announcement of its latest AI GPU system, the DGX H100, projected maximum power consumption of 10.2 kW, 160% of the previous generation A100. And the H100 server boards are available in 4 and 8 GPU configurations so that vendors can build their own clusters. This means that high-density racks, which were once naively considered to be anything over 16 kW, can easily be configured in the 40 kW and up range.
Don’t Sweat the Small Stuff
Delivering this kind of power is one thing, and the makers of PDUs (power distribution units) are finding appropriate solutions, providing a data center has the necessary power available (with power availability being an issue unto itself). Cooling this kind of density is a completely different story, and the leading options are liquid cooling solutions.
Fortunately, there are a number of solutions already available that allow users to cool anything from a single component, to system, to rack, row, hall, or entire data centers, depending upon the needs of the customer. Many of these solutions are quite mature, though only recently seeing broader acceptance.
But as this week’s announcement of the acquisition of liquid cooling leader CoolIT Systems shows, there are high expectations that the market for cooling high-density rack systems will continue to grow. Increasing rack density is the only practical solution to data center real estate needs, especially for AI services. The high energy demands of AI GPU/CPU solutions mean that it is more practical and cost effective to be able to deploy one 50 kW capable rack or even greater power density, than multiple 15 kW – 20 kW racks providing the same compute. The power requirements remain the same, the cooling costs could potentially be lower when compared to installing multiple systems to cool that same amount of power spread out over a larger area, and the supporting hardware (racks, PDUs, CDUs, etc.) potentially drop in cost as well, resulting in overall gains in efficiency with the higher density deployments.
High-density data center specialist Colovore provides a good example of a data center built to support high-density racks and their related workloads. All of the racks in their existing facility support 35 kW and racks can use rear-door liquid cooled heat exchangers to provide suitable cooling. Cooling vendors generally rate rear-door heat exchangers as being suitable for up to 40 kW of power, making them one of the simplest ways to cool your high density racks.
But this is just the beginning; when we spoke to Colovore last year about their building a new facility, they told us their plans included offering direct liquid cooling technologies in the new data center that will allow them to support densities as high as 250 kW per rack.
If You Offer It, They Will Use It
Once 200 kW plus rack densities are available in data centers, you can be sure that customers will be lining up to make use of it, especially in areas where new data center space and power are tightly constrained. With each new generation of AI and HPC specific hardware seeing a significant increase in capabilities along with a 30%-60% increase in power consumption (based on past trends), these extreme density racks will become commonplace for those technologies.
Racks in the 15 kW-30 kW range will replace today’s commonplace 10 kW racks if only for reasons of more economic use of space and cooling locations. The industry is faced with the need to increase rack densities while taking advantage of any economies that can be derived from utilizing high- and extreme-density racking.
Fortunately, for most data centers this is not an all or nothing proposition. For example, rear-door heat exchangers (RDHX) can be retrofitted to existing server racks. A RDHX has the benefit of being a passive cooling solution. While liquid, most commonly water, does get circulated from the RDHX to an external heat exchanger or cooling tower, the IT workload equipment remains untouched.
And because they can be retrofitted on a per rack basis, the technology scales simply and can be applied to racks that will require additional cooling as their IT workload changes or new technologies, such as a rack of AI servers are deployed, allowing for much higher rack densities, where necessary, than traditional data center air-to-air cooling. This also means that rack locations originally designed to maximize at 10 kw – 12 kW, can now effectively be used to support 20 kW – 30 kW workloads, providing a major increase in possible rack densities and greater flexibility in workload placement within a data center.
And for data centers designed to be able to increase available power, as many recent facilities have done, the addition of RDHX to increase rack density means more efficient use of the power being delivered to the facility.
Flexible Cooling Tech Can Also Boost Rack Density
Once a decision is made to adopt liquid cooling technology to allow for higher density racks, there is a broad range of potential solutions, almost all of which can be tailored to meet the needs of specific applications.
While a solution such as an RDHX provides cooling to a rack of equipment, your needs might be such that a full immersion system, from a vendor like Iceotope or GR Cooling is the right choice to cool your on-premise HPC solution, while your existing cooling solutions are adequate for your standard IT workloads.
Or perhaps you’ve made the decision to deploy a flexible liquid cooling infrastructure within your existing data center so that you can more efficiently utilize the space. Now you have the option, in addition to RDHX, to deploy cold plate cooling to hit specific hot spots that you’ve determined can be problem areas. By explicitly cooling, CPUs, GPUs, memory, or entire blades you can effectively control the heat being distributed by specific systems in your rack environments, allowing for more efficiencies in cooling within your data center in its entirety. Individual servers and racks can get tailored, liquid-cooled solutions in the exiting environment with little to no impact on other hardware in the environment.
It's Not If, But When
Your current data centers may never see the need for extreme density racks or even high density levels of power demand, but the long term advantages of higher densities, more efficient cooling, hotter operations, and more flexible solutions will manifest themselves in more cost effective data centers which can demonstrate better ROI and OPEX costs over the projected life of the facilities.
Your new data centers will be built with these capabilities as part of the design. It only makes sense to build for the future and have as many ways available as possible to deliver the most efficient, sustainable, effective operation. How much you invest in your existing facilities to improve their performance and operational effectiveness is a different story and will, in most situations need to be justified on a case by case basis, but doing it right will likely show a direct impact on your business model and your bottom line.