Optimizing Public Cloud Infrastructure for Peak Performance

In this edition of Voices of the Industry, Jeff Klaus, General Manager of Intel Data Center Management Solutions, explores the growing market of public cloud services, as well as how to optimize public cloud infrastructure.

Jeff Klaus, General Manager of Intel Data Center Management Solutions

According to Gartner, the global public cloud services market is projected to grow approximately 21 percent this year, totaling more than $186 billion, and up from $153.5 billion in 2017. The research and advisory firm also predicts that by 2021, nearly a third of all IT spending will be on cloud-based infrastructure, middleware, application and business process services.

Many more enterprises are increasingly seeing public cloud as a top priority of their IT strategy, 38 percent in 2018 as compared to 29 percent last year, according to RightScale’s most recent State of the Cloud Survey. Citigroup, FedEx, GAP, GE, Intuit, and Kaiser Permanente are now among the countless global organizations across diverse industries that now rely on public cloud services in a quest to realize digital transformation.

There’s no doubt that public clouds help enterprises to shift capital expenses to operational expenses, keep infrastructure costs low for new projects, and offer economies of scale as well as virtually unlimited elasticity. Infrastructure-as-a-Service (IaaS) is one of the biggest growth areas within the public cloud, and Software-as-a-Service (SaaS) applications such as Customer Relationship Management (CRM) and Enterprise Resource Planning (ERP) continue to gain traction across the enterprise landscape, primarily due to their flexibility.

Despite industry leading companies working to offer cloud computing services as a core part of their businesses, workloads in a cloud environment can still be very complex. Moreover, with IaaS or Platform-as-a-Service (PaaS) clouds, Cloud Service Providers (CSPs) might not have access to application performance data. Additionally, many important cloud workloads are latency-critical, and hence require strict levels of Quality-of-Service (QoS) to meet user expectations. Marginal QoS delays of mere hundreds of milliseconds can greatly impact user experience. A web search, for example, must be completed within a fraction of a second, otherwise users are likely to give up and abandon their search. A delay at an e-Commerce site will often mean that a customer takes his or her business elsewhere.

Does Your Cloud Run on Schedule?

One of the important issues facing cloud computing is scheduling, which affects system performance. Scheduling mainly enhances the efficient utilization of resources and hence reduction in task completion time. Latency-critical workloads need be scheduled as soon as possible to avoid any queuing delays. On the other hand, best-effort jobs such as Hadoop MapReduce should be allowed to occupy when there are idle resources to improve resource utilization. The challenge is how to minimize the queuing delays of short jobs while maximizing cluster utilization.

One of the important issues facing cloud computing is scheduling, which affects system performance.

Some scheduling approaches only assign resources statically, but do not regulate resources after the initial assignment. These methods fail to provide an effective solution for managing the dynamic nature of cloud workloads. Most CSPs need to understand their workload properties to effectively predict the QoS and optimize the specified metrics.

Tier 1 CSPs have good scheduling, but the key algorithm is credential-based and may be workload specific. Most CSPs lack effective scheduling of their workloads, leading to inefficient utilization of resources and an increase in task completion time.

The good news for CSPs is that there is a resource management solution that provides an effective scheduler algorithm and can allocate the proper hardware resource, such as CPU cores and cache size, to the different workloads. The latency-critical workloads can then meet their Service Level Agreement (SLA) requirements and the entire hardware resource is fully utilized.

A resource management solution can provide important application utilization data, such as per-application cache occupancy and per-application memory bandwidth data. By incorporating the counters provided by the platform’s Performance Monitoring Unit (PMU), such as cycles per instruction and low-level cache misses per instruction, a resource management solution makes it possible to access valuable information to enhance application performance monitoring. Again, the objective is to help CSPs manage their resources more efficiently.

Based on the historical application utilization data and metrics, a resource management solution uses an advanced machine learning control loop to enable best effort workload colocation without impacting the latency-sensitive application. Another usage is that by leveraging a resource director technology solution and other performance metrics, a resource management solution uses the machine learning algorithm to effectively detect the performance impact from the varying contentions, such as cache contention, TDP contention and memory bandwidth contention. This information is also invaluable for CSP’s existing scheduling and therefore improves their efficiency.

All good in theory, but successful use cases are where the rubber meets the road, or perhaps where the application meets the cloud. In a recent experiment, IT staff ran latency-critical workloads such as Apache Cassandra and deployed the machine learning control loop algorithm. Cassandra is an open-source, distributed NoSQL database system that runs on commodity hardware or cloud infrastructure for mission-critical data. Constant Contact, eBay, GoDaddy, Hulu, Instagram, Netflix, and The Weather Channel all use Cassandra to handle large, active data sets.

If a best effort workload is added statically, the latency critical workload is highly impacted and can’t meet its SLA. With the machine learning control loop algorithm to assist the best effort workload, Cassandra performed very well, and moreover achieved similar best effort job performance output with the static scheduler.

In the near future, public cloud services will increasingly be enlisted to support everything from Artificial Intelligence (AI) and machine learning, to more Big Data analytics and Internet of Things applications, to Containers-as-a-Service and perhaps even blockchain technology. These trends will produce massive amounts of data across public cloud infrastructures that will need to run on time and efficiently in order to meet enterprise and end-user expectations.

Jeff Klaus is General Manager of Intel Data Center Management Solutions.

Explore the evolving world of edge computing further through Data Center Frontier’s special report series and ongoing coverage.