The Ghost in the Machine That Maximizes Uptime
Jeff Klaus, General Manager of Intel Data Center Management Solutions, explores how Intel Memory Failure Prediction (Intel MFP) uses machine learning to improve the performance and reliability of server memory to predict potential failures and prevent downtime.
China is the largest e-commerce market in the world, projected to reach $1.1 trillion in 2023, up from $572 billion in 2017, according to Statista. The combined forces of the country’s robust economic growth, a rapidly emerging middle class, a large population of computer literate consumers, and the proliferation of smartphones are among the major factors driving the sale of physical goods and services via digital channels.
Headquartered in Beijing, Meituan-Dianping (Meituan) offers an online delivery and social commerce platform whose apps connect consumers with local businesses for food delivery, groceries, restaurant recommendations, hotel bookings, movie tickets, bike sharing, and health and fitness products and services. Named to Fast Company’s top-50 list of the world’s most innovative companies based in China last year, Meituan is keenly focused on developing new ways to make its delivery platform more cost-effective and efficient. The firm also leverages a wealth of data to assist local businesses to find new opportunities in the marketplace, such as determining where they might expand with new restaurants or retail locations. Like any thriving eCommerce company that strives to remain competitive and successful, Meituan must be able to rely on the health of its data center infrastructure.
Staying Online and Open for Business
A Ponemon Institute study found that the average cost of an outage in the e-commerce sector is $758,000. However, for an eCommerce business, downtime can result in collateral damage that goes far beyond the immediate loss of revenue. Brand reputation, productivity, and search engine optimization (SEO) ranking can also suffer in the wake of an outage. Imagine pulling down the shade and locking the front door of your local business just as a customer arrives at the threshold. In effect, that’s just what occurs at an e-commerce company experiencing a downtime event when customers can’t access their app.
Memory failures are one of the top three hardware failures that occur in data centers today. Intel Memory Failure Prediction (Intel MFP) is an ideal solution for organizations operating online services platforms like Meituan, as well as cloud service providers relying heavily on server hardware reliability, availability and serviceability. Intel MFP helps to significantly reduce memory failure events by analyzing data and then predicting catastrophic events before they happen.
Recently, Meituan deployed Intel MFP in a test environment containing several thousands of servers based on Intel Xeon Scalable Processors to help improve the performance and reliability of its server memory, which is essential to a data analytics computing environment. Intel MFP uses machine learning to analyze server memory errors down to the Dual Inline Memory Module (DIMM), bank, column, row, and cell levels to generate a memory health score, which can be used to predict potential failures.
As the consumers of Asia and the rest of the world continue on the path towards unfettered mobility, and as the IoT becomes more and more integrated with the apps that increasingly affect how we live and do business, the demands on data center infrastructure will only become greater.
Maintaining SLAs and Maximizing Uptime
Meituan monitored the health of the memory modules of their servers by integrating Intel MFP into their existing data center management solution. By analyzing data that was previously collected by their data center management software, they were able to generate prediction scores for each Dynamic Random Access Memory (DRAM) module, and then take appropriate action to maintain their service level agreements (SLAs) and maximize their service uptimes.
Intel MFP generated memory health scores that assisted Meituan to make memory reliability-aware decisions in workload scheduling, such as migrating the critical tasks running on distressed servers to other servers, providing ample time to take actions and avoid critical application crashes. Moreover, by analyzing memory errors and predicting potential memory failures before they happen, Intel MFP helped improve DIMM replace strategy.
Intel MFP can also help Meituan optimize OS page offlining. When there is a burst in the number of errors in a specific memory region, that region is soon likely to break down. By detecting this early, Intel MFP can suggest disabling faulty memory pages, preventing them from being used again, and thus reducing the risk of uncorrectable errors. Page offlining has become critical for large-scale data centers.
By integrating Intel MFP into data center management solution, Meituan was able to analyze the health of the memory of the servers in its test environment. This helped Meituan to predict failures before they happen and make informed decisions such as using page offlining and migrating workloads and tasks to other servers.
The initial Intel MFP test deployment indicated that if Meituan deployed the solution across its full server network, server crashes caused by hardware failures could be reduced by up to 40 percent, delivering a better experience for hundreds of millions of its customers and local vendors.
Worldwide, ecommerce sales amounted to $3.53 trillion last year and are projected to grow to $6.54 trillion in 2022. While desktop PCs remain the most popular device for placing online orders, mobile devices, particularly smartphones, are quickly closing the gap. In fact, in China, four out of five ecommerce dollars were generated from mobile devices last year, according to a report by eMarketer.
As the consumers of Asia and the rest of the world continue on the path towards unfettered mobility, and as the IoT becomes more and more integrated with the apps that increasingly affect how we live and do business, the demands on data center infrastructure will only become greater. Hence, for organizations operating online services platforms that depend on maximum availability, solutions such as Intel MFP, which provides real-time visibility into server memory health and can predict catastrophic server memory failures before they happen, will become critical to their ongoing innovation and expansion.
Jeff Klaus is General Manager of Intel Data Center Management Solutions.