NVIDIA Releases AI Reference Architectures For Enterprise-Class Hardware

With reference designs derived from the existing Nvidia Cloud Partner reference architecture, Nvidia right-sizes for enterprise customers.

David Chernicoff

Nov. 7, 2024

4 min read

After the release around the OCP Summit of a number of AI reference architectures from vendors who worked with Nvidia to develop the infrastructure hardware necessary to deploy Nvidia AI hardware at scale, Nvidia has now announced their own take on a reference architecture for enterprise scale customers, built completely using Nvidia servers and networking technologies.

Optimizing from Design to Deployment

The newly released reference architectures (RA) offering is for enterprise class hardware at deployments ranging from 32 to 1,024 GPUs, and is derived from the existing Nvidia Cloud Partner (NCP) design, which is focused on deployments of 128 to over 16,000 nodes.

While the NCP architecture is designed to be a guide for customers working with the professional services offered by Nvidia, the new enterprise class RAs are being used by Nvidia-partner hardware vendors, such as HPE, Dell, and Supermicro to develop Nvidia-certified solutions using NVIDIA Hopper GPUs, Blackwell GPUs, Grace CPUs, Spectrum-X networking, and BlueFieldDPU architectures.

Each of these technologies has an existing, node-level, certification program which allows the partner vendors to deliver systems that are optimized for the services their customers are looking to develop. But just using certified components is not sufficient for vendors to release a certified solution.

The certified server configuration is subject to a testing program which includes thermal analysis, mechanical stress tests, power consumption evaluations, and signal integrity assessments as well as passing a set of operational testing requirements which include management capabilities, security, and networking performance, across a range of workloads, applications, and use cases.

Once a certified selection is chosen, Nvidia’s RA guides also offer guidance on building clusters, using their nodes, and following common design patterns.

As announced, each of the enterprise reference architectures has certain factors in common, including recommendations for:

Accelerated infrastructure using optimized Nvidia-certified server configurations.
AI-optimized networking based on Spectrum-X Ethernet and Blufield-3 DPUs.
Nvidia AI enterprise software platform for production AI deployments.

Scale-out vs. Scale-up

Understanding the nature of customer demand, the Nvidia RAs support both scale-up and scale-out deployments, based on the customer needs and long-term planning.

In the simplest terms, a scale-up deployment uses NVLink connections to create a high-bandwidth, multi-node GPU cluster which can be treated a a single, huge, GPU with the workloads distributed across multiple servers.

The systems can scale up to the maximum capability of the NVLink connection, with the example being given that he GB200 NVL72 can scale up to 72 GPUs, functioning as a single unit.

In the scale-out model, as described in the RA outline, the approach uses optimized PCIe connections to scale clusters, which, depending on the architectural design chosen, can scale from four to 96 clsuter nodes. The workload is distributed to servers which can have different levels of optimation and performance.

Benefits of Choosing the RA Designs and Certified Vendors

Overall, the selection of systems built using these reference architecture designs means getting the benefits of the years of engineering expertise and experience that Nvidia has acquired in building large scale systems. Nvidia breaks the benefits down into those that apply to IT and business operations.

For IT, making use of these designs results in higher performance, easier support, reduced complexity and improved TCO, and ease in scaling and managing the deployments.

For the business concerns, Nvidia highlights reduced costs, better efficiency, accelerated time to market and significant risk mitigation.

The guidance provided is designed to accelerate the transition of data centers from traditional working models to AI factories.

Vendors making use of the RAs get the benefits of Nvidia’s experience working with their hardware and software solutions while customers choosing to deploy the certified solutions are assured of minimal roadblocks in their path to a rapid and effective deployment.

Keep pace with the fast-moving world of data centers and cloud computing by connecting with Data Center Frontier on LinkedIn, following us on X/Twitter and Facebook, and signing up for our weekly newsletters using the form below.

About the Author

David Chernicoff

David Chernicoff is an experienced technologist and editorial content creator with the ability to see the connections between technology and business while figuring out how to get the most from both and to explain the needs of business to IT and IT to business.

NVIDIA and OpenAI Forge $100B Alliance to Power the Next AI Revolution

As CoreWeave Files for IPO, Will NVIDIA Tighten Its Grip on the AI Cloud?

Sponsored

DigitalBridge Launches VC Unit, Outlines Full Stack Investment Strategy

Sponsored

American Tower Adds Stonepeak as Investor in its Cloud-to-Edge Data Center Strategy

Voices of the Industry

Sponsored

Tomorrow’s AI Economy Needs Data Center Efficiency Today

Sadiq Syed, SVP Digital Buildings for Schneider Electric, explains how reducing energy waste can support data centers looking to get the most out of AI.

Sponsored

Seeing the Full Picture: Why Visibility and Reporting Matter in the Mission-Critical Supply Chain

Jarrett Atkinson of BluePrint Supply Chain breaks down how true supply chain visibility gives project teams the early warnings and actionable insight needed to anticipate risk...

NVIDIA Releases AI Reference Architectures For Enterprise-Class Hardware

Optimizing from Design to Deployment

Scale-out vs. Scale-up

Benefits of Choosing the RA Designs and Certified Vendors

About the Author

David Chernicoff

Related

NVIDIA and OpenAI Forge $100B Alliance to Power the Next AI Revolution

As CoreWeave Files for IPO, Will NVIDIA Tighten Its Grip on the AI Cloud?

DigitalBridge Launches VC Unit, Outlines Full Stack Investment Strategy

American Tower Adds Stonepeak as Investor in its Cloud-to-Edge Data Center Strategy

Voices of the Industry

Tomorrow’s AI Economy Needs Data Center Efficiency Today

Seeing the Full Picture: Why Visibility and Reporting Matter in the Mission-Critical Supply Chain

Trending

Building the Thermal Backbone of AI: Tracking the Latest Data Center Liquid Cooling Deals and Deployments

Google’s TPU Roadmap: Challenging Nvidia’s Dominance in AI Infrastructure

DCF Trends Summit 2025 - Scaling AI: Adaptive Reuse, Power-Rich Sites, and the New GPU Frontier

Sponsored Picks

DigitalBridge Launches VC Unit, Outlines Full Stack Investment Strategy

American Tower Adds Stonepeak as Investor in its Cloud-to-Edge Data Center Strategy

Facing the Challenges of Today’s Modern Data Center: Know Your Site – From Hyperscale to Edge