Vertiv, NVIDIA Collaboration Models Importance of Reference Designs to AI's Future in the Data Center
As we are seeing, the exponential growth of artificial intelligence (AI) applications has transformed the landscape of data center design. From deep learning models to large-scale machine learning and real-time inference, the increasingly complex and high-performance infrastructure needed to support AI and HPC workloads is having a direct impact.
To meet these demands and shorten the timeframe for deployments, companies are turning to reference architectures for rack and cooling systems, which offer a standardized and scalable approach to building AI-centric data centers.
These architectures streamline the deployment of advanced hardware and ensure that facilities are equipped to handle the immense computational, thermal and power demands of the demanding infrastructures.
Open Reference Architectures (At Least for NVIDIA) to Plan and Build AI Factories
As the provider of the bulk of energy-intensive GPUs in the data center, it’s unsurprising that NVIDIA has opened up some of their architectural designs for supporting AI computing.
At October's 2024 OCP Global Summit the company announced that they had both contributed foundational elements of its NVIDIA Blackwell accelerated computing platform design to the Open Compute Project (OCP), and increased support for OCP standards within their NVIDIA Spectrum-X platform.
In order to broaden industry support for the NVIDIA GB200 NVL72 system, the company announced that they would make available information on how its design supports the system's rack architecture, compute and switch tray mechanicals, liquid-cooling and thermal environment specifications, and NVLink cable cartridge volumetrics.
This information is critical for vendors looking to support the NVIDIA GPU hardware platform, and the demand that it’s connectivity and compute aspects place on the supporting infrastructure.
OCP Tie-In
Based on the NVIDIA MGX Modular Architecture, the latest NVIDIA GB200 NVL72 system design supports 32 Grace CPUs and 72 Blackwell GPUs in a liquid-cooled, rack-scale design.
The architecture of the design also supports the OCP’s Switch Abstraction Interface (SAI) and Software for Open Networking in the Cloud (SONiC) standards to accelerate a scale out AI infrastructure that uses Nvidia’s networking hardware.
According to NVIDIA, they are working with more than 40 electronics makers worldwide that can provide the key components for the creation of AI factories.
They are also working with heavy users of their AI GPUs, who are also contributing to the open reference architecture standards promulgated by the OCP.
For example, Meta, who as Facebook was the original founder of the OCP, plans to contribute its Catalina AI rack architecture as designed for the GB200NVL72 to the OCP. According to Yee Jiun Song, vice president of engineering at Meta:
As we progress to meet the increasing computational demands of large-scale artificial intelligence, NVIDIA’s latest contributions in rack design and modular architecture will help speed up the development and implementation of AI infrastructure across the industry.
But as industry has acknowledged, these dense, high-performance systems generate enormous amounts of heat, necessitating the adoption of liquid cooling and other advanced thermal management solutions.
And of course, the power demands of AI workloads are unprecedented. AI racks running even the 105 processers in the latest NVIDIA design require significantly more power than traditional workloads.
Managing this power efficiently while keeping costs under control requires an integrated approach, combining power and cooling infrastructure into a seamless solution. This is where reference architectures come in.
Bring On Vertiv’s Collaboration with NVIDIA
Also at the OCP Summit, Vertiv announced a reference architecture design from a manufacturer who can provide the supporting infrastructure in an almost turnkey fashion.
The reference design takes an end-to-end approach to power and cooling, optimizing each component to ensure that data centers can meet the unique requirements of AI applications.
This includes a hybrid liquid- and air-cooling system, a high-efficiency system that meets the demand of managing the heat generated by the NVIDIA GB200 GPUs.
Including the ability to support up to 7 MW of power per data center, the architecture also includes power management technologies like Vertiv’s Trinergy UPS and EnergyCore lithium battery cabinets, ensuring efficient use of power and improved reliability.
And while the current delays in AI deployment are often an issue of the availability of the GPUs, the Vertiv-NVIDIA collaboration offers pre-configured modules and factory integration, allowing the system to be deployed up to 50% faster than traditional on-site builds.
As Jensen Huang, Founder and CEO of Nvidia pointed out:
New data centers are built for accelerated computing and generative AI with architectures that are significantly more complex than those for general-purpose computing. With Vertiv's world-class cooling and power technologies, NVIDIA can realize our vision to reinvent computing and build a new industry of AI factories that produce digital intelligence to benefit every company and industry.
One of the advantages that Vertiv points out with the use of this reference architecture is that the design has a power and cooling infrastructure matched to, and closely coupled with, the NVIDIA Blackwell platform.
Also, this isn’t a standalone or one-off design: the reference architecture joins the Vertiv 360AI portfolio of reference designs for retrofit and greenfield data centers and takes advantage of the lessons learned in the development of the range of data center designs.
Other advantages that Vertiv brings to the table with this design collaboration include:
- Rapid Deployment and Retrofit: Enabling the use of preconfigured modules and factory integration, delivers turnkey AI critical infrastructure up to 50% faster than onsite builds.
- Space-Saving Power Management: Using Vertiv’s advanced power technologies, including uninterruptible power supply system (UPS) and lithium battery cabinets, the design uses approximately~40% less space when compared to legacy offerings.
- Energy-Efficient Cooling: Integrating liquid and low-GWP (Global Warming Potential) air cooling technologies at scale.
- Dynamic Workload Management: Integrated load averaging via lithium-ion battery and next generation UPS provides support for dynamic GPU workloads.
- Installation and Operations Services: With over 4000 field service engineers Vertiv can be the lifecycle service partner, and complex system-level experts, for retrofit and new builds.
This collaboration between Vertiv and NVIDIA is a leading example of how companies are working together to meet the challenges of AI infrastructure.
In fact, with the ability to support 132 kW per rack and up to 7 MW per data center, this reference design standard for AI-centric data centers will require that businesses re-evaluate their design goals for AI-capable data center development.
How More Vendors Supporting AI Are Offering Their Own Reference Designs
With the level of demand for NVIDIA GPUs, rapid deployment and standardized infrastructure, it is no surprise that other vendors are jumping on the opportunity to demonstrate their expertise in the marketplace.
Much like Meta, Microsoft has developed their own custom GPU cabinets for its Microsoft Azure AI deployments. Shown publicly at the beginning of October, Microsoft, who has previously announced support for OCP standards, showed a custom liquid-cooled AI computing rack.
Demonstrating it at the Microsoft Ignite conference and at the OCP Summit, Microsoft claimed that their Azure deployment was the first with a Blackwell system with NVIDIA GB200 power servers.
The systems use custom closed-loop liquid cooling and are designed to handle the power and heat dissipation requirements of AI workloads. Microsoft’s approach emphasized Infiniband networking, and advanced cooling techniques.
Microsoft is backing a significant effort in many aspects of next generation data center design, including power and cooling, and back in April 2024 released their State of the AI Infrastructure report. If you missed it, download it here.
Supermicro was early to the party, having demonstrated their custom turnkey systems supporting rack-scale, plug-and-play liquid-cooled AI SuperClusters supporting the NVIDIA Blackwell and H100/H200 back in March at the NVIDIA announcement of the next-generation GPUs.
Much like Vertiv, Supermicro touts the advantages of their turnkey solutions, though they lack the breadth of product that Vertiv has available.
Also at the OCP Summit, ASRock Rack announced their 6U8X-GNR2/DLC system, designed specifically for NVIDIA Blackwell GPUs and featuring direct-to-chip liquid cooling technology.
This system is aimed at organizations running large-scale AI inference models and differentiates itself by offering a compact form factor.
The Importance of the Open Compute Project to Future AI Development
In a blog posted on October 15, 2024, the OCP Foundation recognized the expansion of its Open System for AI initiative, recognizing NVIDIA and Meta for their efforts in this space.
Originally announced in January, the founding members of the initiative included Intel, Microsoft, Google, Meta, NVIDIA, AMD, ARM, Ampere, Samsung, Seagate, SuperMicro, Dell and Broadcom.
With an announced goal...
“…to establish commonalities and develop open standardizations for AI clusters and the data center facilities that host them, advancing efficiency, sustainability and enabling the development of a multi-vendor supply chain that rapidly and impactfully advances the market adoption...”
...the announcements made prior to and at the OCP Summit 2024 show that there are significant efforts being made to live up to this target.
AI Reference Designs: Perspective
Possibly more important is the impact that these collaborative efforts, especially from vendors like Vertiv who offer the full range of data center infrastructure product, will have on all next generation data center designs.
Not every data center operator is likely to have a significant focus on research and development of these designs, and the efforts being made by those companies that can offer an integrated environment can show the way forward.
Beyond the OCP, the efforts being made in developing efficient power, cooling, and distribution environments is of critical importance to the data center industry as a whole, with the necessity of deploying more energy efficient, environmentally friendly, and sustainable data centers having moved front-and-center in construction and operational planning.