In this edition of Voices of the Industry, Chad Peters, Director of Infrastructure Solutions for Service Express, revisits the Bathtub Curve theory and explains how to track equipment reliability and performance for data-driven buying decisions.
The power of computing and data storage is only as good as product reliability. With billions of individuals accessing data daily for personal and professional use, confidence in its accessibility and availability is paramount to everyday life. But how can OEMs and suppliers ensure their products are reliable — not just at the point of sale — but for years down the line?
We often hear the Bathtub Curve is the tried-and-true method for determining product reliability, but this common tactic often overlooks the complexity of longevity and reliability. Read on as we dispel the myth of a previously thought ironclad method and break down the importance of continued data collection throughout the product lifecycle.
Revisiting the Bathtub Curve
In our popular 2020 article The Bathtub Curve and Data Center Equipment Reliability, Jake Blough, Chief Technology Officer for Service Express, dispels several myths about new and aging hardware. The Bathtub Curve theory suggests that equipment failure rates are high when a product is released, then decrease over the next two to three years. According to the theory, the failures increase again toward the product’s End of Life (EOL) date.
Although the Bathtub Curve (pictured below) accurately reflects the failure behavior of many products, we’ve found it does not universally apply to all equipment the same.
Defining critical and non-critical failures
To better understand the context of the Bathtub Curve, it’s essential to define the common types of failures seen in data center hardware: critical and non-critical.
A critical failure occurs when something like a CPU or system board fails. Critical server failures result in the loss of access to applications or data, a significant issue that impacts overall business productivity.
A non-critical failure occurs when a component like a disk drive or power supply fails. Modern data center equipment has built-in redundancies for these components, so there is no data loss.
Contrary to what the Bathtub Curve portrays, the sample data above shows failure rates don’t drastically increase. By combining reliability and longevity data, we can better understand how equipment performs over time.
Testing the myth
At Service Express, we’ve collected over 15 years of equipment data from over half a million devices to understand equipment longevity and reliability better.
Our previous article only studied equipment longevity and how it stacks up against the Bathtub Curve. In the past two years, we’ve implemented real-time reliability studies that allow for previously unseen granularity. But what’s the difference between longevity and reliability?
Longevity reporting details a product’s expected failure rate over time. This can be useful for CapEx budget planning purposes. However, beyond equipment longevity, reliability plays a critical role. Reliability studies examine how equipment has performed over time. This practice is useful in identifying outliers within a product model number or family.
Below is an example of equipment and its reliability. This graph utilizes data sets from over 500,000 pieces of equipment under agreement for over 6,000 customers and compares performance using 55,000 average annual service calls.
In this example (pictured above), we use the following definitions when describing the quadrants.
- Upper right – Customer model is performing better than the product family and perhaps better than similar models
- Upper left – Customer model is worse than the overall product family but is better than similar models for other customers
- Lower left – Model is underperforming against the product family, and the customer’s experience is worse than others with similar models
- Lower right – Customer experience with the model is worse than others with similar models but better than the product family
The data shows that Product A is in the lower right quadrant. This position signifies that, based on equipment model failures, Product A is typically a well-performing product compared to other products within its product family. Still, the customer is having a worse experience than most.
When utilizing historic equipment reliability and performance data, customers can better understand which equipment meets performance needs or vice versa. This practice has helped customers make data-driven decisions in the data center to optimize current environments or make necessary upgrades to maintain productivity.
The impact of leveraging data-driven equipment insights
The Bathtub Curve is a useful tool when speaking about equipment performance in general. However, this method comes with many misnomers that can cause suppliers to repurchase parts or switch models before necessary.
By combining longevity studies with real-time reliability data, we can better understand which equipment will continue to perform over time and which hardware requires our attention. When those high performers (or outliers) are defined, customers can better plan or extend hardware refreshes and more efficiently develop budget and resource planning.
Chad Peters, Director of Infrastructure Solutions for Service Express, a global data center solutions provider that helps IT teams control costs, optimize infrastructure strategies and automate support.