The Colossus AI Supercomputer: Elon Musk’s Drive Toward Data Center AI Technology Domination

Nov. 20, 2024
How xAI and NVIDIA are pushing the limits of AI data center compute, power, and cooling possibilities.

Elon Musk’s big-picture vision for various tech sectors has now focused on artificial intelligence (AI) with xAI, a company created for the purpose of AI development. At the center of this effort is Colossus, one of the world’s most powerful supercomputers that could radically redefine the capabilities of AI.

The creation of Colossus marks a key achievement, not only for Musk’s xAI, but also for the AI community at large, which wants to play a leading role in the technology's adoption.

Origins and Vision Behind xAI

xAI was officially established in mid-2023 by Musk, CEO of Tesla and SpaceX, with the aim to  “discover what the real world is like."

According to its mission statement, "xAI is a company working on building artificial intelligence to accelerate human scientific discovery. We are guided by our mission to advance our collective understanding of the universe."

According to Musk, he founded the company because he began to worry about the dangers of unregulated AI. xAI has the stated goal of using AI for scientific discovery but in a manner that is not exploitative.

The xAI supercomputer is designed to drive cutting-edge AI research, from machine learning to neural networks with a plan to use Colossus to train large language models (like OpenAI’s GPT-series) and extend the framework into areas including autonomous machines, robotics and scientific simulations.

Colossus

Colossus was launched in September 2024 in Memphis, Tennessee. The data center is located at a former Electrolux manufacturing site (photo here) in a South Memphis industrial park.

The Tennessee Valley Authority has approved an arrangement to furnish greater than 100 megawatts of power to the site.

The Colossus system started with 100,000 Nvidia H100 GPUs, which made it one of the world’s most significant AI training platforms.

The deployment of these GPUs, in 19 days, underscored xAI’s focus on scaling its AI infrastructure quickly.

Consider that configuring such an extensive infrastructure usually takes months, even years to deploy, the deployment itself drew significant attention from the media and the data center/AI industry.

This initial setup of 100,000 GPUs allowed it to hit a high-level of processing, making xAI capable of tackling highly complex AI models at cutting-edge speeds.

This speed and efficiency are essential given the ever-increasing complexity and size of contemporary AI models, which need to be fed massive datasets and use enormous computational power.

Very much a “if you build it, they will come” model, the LLM designs focused on making use of the processing power available.

Expansion Plans and Upgrades

In November 2024, xAI announced it would double the capacity of Colossus through a multibillion-dollar deal.

The firm plans to raise $6 billion in the coming years, with the bulk of it coming from Middle Eastern sovereign wealth funds.

It will cover the cost of adding 100,000 more GPUs to the existing set, bringing it to 200,000.

The planned upgrade would add Nvidia’s new Blackwell H200 GPUs that are even more powerful than the H100 GPUs originally shipped.

NVIDIA Hits a Snag

The H200 GPUs provide considerable improvements in performance and efficiency, and will allow xAI to train AI models faster and more accurately.

These GPUs are optimized for performing deep learning and neural network training, thus making them a perfect fit for xAI’s larger AI projects.

According to Nvidia, the Blackwell GPUs can be up to 20 times faster than their previous generation GPUs, depending on the workload.

However, the delivery of the Blackwell GPU to customers has hit a snag.

The next generation chip delivery to customers had already been delayed by a quarter, due to Nvidia discovering and fixing some design flaws.

A new delay has cropped up as it has been reported that the 72 GPU configuration was overheating in Nvidia’s custom-designed server racks.

Yahoo Finance reported that the announcement of the problem caused an almost 3% drop in the value of Nvidia stock, even though a potential delay in this 2025 delivery of the GB200 had not been confirmed, nor was Nvidia willing to comment on whether the final design for the server racks been finalized.

This larger Colossus infrastructure will make it much easier for xAI to build and test its AI models (particularly the Grok LLMs).

They are meant to challenge, and possibly even exceed, currently dominant AI systems such as OpenAI’s GPT-4 and Google’s Bard.

Designed for AI

Colossus is different from other supercomputers not simply for its underlying computing power but also for its tailor-made AI infrastructure.

The system is built to manage the special needs of AI training — handling massive amounts of data and running highly advanced algorithms that must be parallelized. 

As widely reported, both Dell Technologies and Supermicro partnered with xAI to build the supercomputer.

The combination of Nvidia’s H100 and H200 GPUs will put Colossus at a distinct advantage when it comes to speed and efficiency. These GPUs also feature dedicated tensor cores that help accelerate deep learning algorithms.

Additionally, the memory bandwidth of these GPUs is powerful enough to efficiently handle the big data sets needed to train the latest AI models.

The primary building block of Colossus is the Supermicro 4U Universal GPU Liquid Cooled system.

Each 4U server is equipped with eight NVIDIA H100 Tensor Core GPUs, providing substantial computational power for AI training tasks.

The servers are organized into racks, with each rack containing eight 4U servers, totaling 64 GPUs per rack.

Between each 4U server is a manifold for the liquid cooling, taking up 1U of rack space, and the base of each rack contains a 4U CDU pumping system providing redundant cooling, and management unit.

Ethernet Networking

The servers are interconnected using NVIDIA's Spectrum-X Ethernet networking platform, enabling high-bandwidth, low-latency communication essential for AI training.

Each server is equipped with multiple 400GbE connections, running on 800 GBE capable cabling, rather than the Infiniband option also supported by Nvidia for large scale deployments.

In the current architecture, each GPU in a cluster gets a dedicated 400 GB network interface card, with an additional 400 GBE NIC dedicated to the server, for a potential total bandwith of 3.6 TB per server.

There are 512 GPUs per array (8 racks of 64 GPUs) and almost 200 total arrays.

In October, NVIDIA head Jensen Huang announced that the entire initial 100,000 GPU supercomputer, was set up in only 19 days, comparing that to what he described as the normal four-year build process an average data center would require.

Roadmap

So what is the company doing with all this power?

The Grok family of large language models is xAI’s primary focus. Such models interpret and create human-like text, just like OpenAI’s GPT series.

The Grok models should be more effective and capable than current language models because of Colossus’s computational capability.

Beyond language models, xAI also intends to pursue other AI applications such as autonomous vehicles, robotics, and scientific simulation. With Colossus, xAI intends to challenge the capabilities of AI in these fields.

The company, for example, is exploring using AI in science to find new materials, save energy, and even help to find new drugs.

(And if you think all this power is being used to make Tesla self-driving cars a reality, there is an entirely different AI supercomputer dedicated to this task, i.e. the Cortex AI supercluster with 50,000 GPUs located in Tesla’s GigaTexas plant.)

Late last month, the website ServeTheHome.com was able to release their tour of xAI Colossus, pointing out that while it was deployed in 19 days, the entire infrastructure took only 122 days to build.

AI: All About Cooling

Colossus also features what has been described as “a cutting edge cooling system” that ensures the GPUs are operating at their most stable and optimal temperature for stability and performance.

This is especially significant as there is so much heat produced by such a high number of fast GPUs.

With this type of rack density, optimal cooling is absolutely critical and makes the potential delay of the Blackwell server infrastructure do to overheating somewhat easier to understand.

With customers like Colossus waiting to immediately roll out the next generation design in huge numbers, the cooling has to work properly from the start.

As we have previously reported, a number of vendors are working with Nvidia to develop cooling explicitly for the Nvidia GPU servers.

While we have covered a number of the vendors in the data center liquid cooling arena, another rapidly advancing player in the space, Boyd, announced last week that they had a product that would “enhance AI deployment ease and speed with Nvidia’s GB200 NVL72.”  This is the server hardware designed for the next generation Blackwell GPU in a 72 GPU cluster.

Doug Britt, CEO of Boyd, pointed out that their liquid-cooling technology was designed for AI, addressing the issues of how to cool these demanding applications and hardware, while easing deployment and making it faster to get up and running. Britt added:

We’re seeing next-generation large language models already exceeding 1 trillion parameters, requiring advanced computational power, like that offered by the NVIDIA GB200 NVL72 platform, which can be enhanced with next-level cooling. AI system architects rely on Boyd cooling to effectively scale compute density without expanding data center and rack footprint in the most energy-efficient way.

Bottom Line: Colossus Is a Line In the Sand for AI Competitors, Not Without Challenges

The competition to create the most effective AI systems has grown over the past few years, with Google, Microsoft and OpenAI investing heavily in supercomputers and AI studies.

With the investment in Colossus there is a potential competitive edge for xAI, enabling it to train its AI models quickly and possibly make breakthroughs more quickly than its rivals.

Having models train at scale not only shortens the time it takes to build new AI technologies, but also helps xAI to dive into new fields of AI research that were otherwise impossible with computational constraints.

By raising the capital to scale Colossus, xAI is positioning itself for future. The additional 100,000 GPUs will nearly double the physical capacity of the system, which will allow xAI to take on even larger challenges.

Meanwhile, the increased performance claims from Nvidia comparing the GB200 GPUs to the existing H100 parts is implying more than a simple mathematical increase in performance. This could have profound effects on the AI community, with xAI’s developments offering the opportunity to redefine what can be done with AI technology.

Colossus isn’t a project without its setbacks. The cost of cooling and powering a system with 200,000 GPUs is significant, especially at a time where sustainability is a number one concern.

Additionally, Musk has stated that he is expecting the funding for the ramp up of Colossus to be dependent on sovereign wealth funds, particularly from the Middle East.

This plan has come in for criticism in some quarters, with the argument being made that foreign ownership of new AI tech might have geopolitical consequences, particularly if it is put to practical uses outside its research role.

DCF will continue following data center developments with xAI and the Colossus supercomputer, so watch this space.

 

 

Keep pace with the fast-moving world of data centers and cloud computing by connecting with Data Center Frontier on LinkedIn, following us on X/Twitter and Facebook, and signing up for our weekly newsletters using the form below.

About the Author

David Chernicoff

David Chernicoff is an experienced technologist and editorial content creator with the ability to see the connections between technology and business while figuring out how to get the most from both and to explain the needs of business to IT and IT to business.

Sponsored Recommendations

How Deep Does Electrical Conduit Need to Be Buried?

In industrial and commercial settings conduit burial depth can impact system performance, maintenance requirements, and overall project costs.

Understanding Fiberglass Conduit: A Comprehensive Guide

RTRC (Reinforced Thermosetting Resin Conduit) is an electrical conduit material commonly used by industrial engineers and contractors.

NECA Manual of Labor Rates Chart

See how Champion Fiberglass compares to PVC, GRC and PVC-coated steel in installation.

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

To help identify cost savings that don’t cut corners on quality, Champion Fiberglass developed a free resource for engineers and contractors.