Reformulating the title: An analysis of the problem of container throttling

**Optimizing CPU Utilization for Twitter Services: Solving the Efficiency Problem**

**Understanding the Problem: Exploring CPU Bound Services at Twitter**

Twitter, being a platform with heavy CPU bound services, faces challenges with increasing efficiency. Most services at Twitter start experiencing performance degradation when CPU utilization reaches around 50% for reserved container CPU. Even though CPU bound services theoretically have the potential for higher CPU utilizations, practical limitations restrict them from exceeding the 50% mark during peak load events. This limitation is caused by an imbalance in load across shards and the severe degradation in performance when exceeding 50% CPU utilization. This article will delve into potential solutions for this problem, taking into account service configurations and the Linux scheduler used at Twitter.

**The Theoretical Problem: Linux Configuration and CPU Bound Services**

Twitter predominantly runs on Linux platform with the CFS scheduler, utilizing CFS bandwidth control quota for isolation. The purpose of this configuration is to enable different services to be colocated on the same machines without one service’s CPU usage negatively impacting others. Moreover, it prevents services on empty machines from monopolizing the CPU resources, leading to unpredictable performance. However, the quota mechanism, while limiting the amortized CPU usage of each container, does not restrict the number of cores a job can use at any given moment. If a job exceeds the allocated cores over a quota timeslice, it temporarily utilizes more cores and then gets throttled, resulting in compromised tail latency.

Most services at Twitter employ thread pools that are larger than their Mesos core reservation. When these jobs experience heavy load, they request and utilize more cores than their reservation, ultimately leading to throttling. As a result, services tend to overprovision CPU to avoid violating Service Level Objectives (SLOs). This overprovisioning involves either requesting more CPUs per shard than needed or increasing the number of shards used. A classic example of this issue was observed in the garbage collector of the JVM. Prior to JVM becoming container aware, each JVM defaulted its garbage collector parallel thread pool size to the number of cores on the machine. Consequently, during garbage collection, these threads would run simultaneously, rapidly exhausting the CPU quota and causing throttling. Although the garbage collector issue has been resolved, this problem persists at the application level for nearly all services running on Mesos at Twitter.

**Real-Life Issue: Case Study on the Largest Twitter Service**

In order to understand the practical implications of this problem, let’s examine “service-1,” which happens to be the largest and most expensive service at Twitter. The following CPU utilization histogram depicts the service reaching its failure point during a load test, right above its peak load capacity:

[Insert image: CPU Utilization Histogram]

Although the service is provisioned for 20 cores, the actual utilization remains significantly below that threshold, even during nearly peak loads. The spikes above 20 cores were responsible for pushing the job above its CPU quota and subsequently causing throttling. This led to a drastic increase in latency, resulting in the violation of Service Level Objectives (SLOs), despite the average utilization being around 8 cores or 40% of the quota. It’s noteworthy that the graph’s sampling period was 10ms while the quota period was 100ms. Thus, it is technically possible to observe excursions above 20 without throttling, but the presence of frequent excursions, especially well above 20, often leads to throttling.

However, after adjusting the thread pool sizes to limit core usage and prevent throttling, the CPU utilization histogram during the load test changed significantly:

[Insert image: Adjusted CPU Utilization Histogram]

At 1.6 times the load (request rate) of the previous histogram, the load test harness could not increase the load to determine the exact peak load for service-1. The feeding service couldn’t sustain the load required for further testing. Nevertheless, subsequent testing revealed that the service could handle approximately 2 times the capacity after optimizing the thread pool sizes. This case study is not an isolated incident, as Andy Wilcox discovered similar performance gains under load in service-2 due to similar reasons. For services prioritizing latency, significant reductions in latency can be achieved by implementing these optimizations. In the case of service-1, a 20% reduction in latency was observed by keeping the provisioned capacity the same instead of cutting it by 2 times. These improvements translate into substantial gains in efficiency and cost reduction, with service-1 alone saving multiple figures per year.

**The Widespread Impact: Thread Usage across Twitter’s Fleet**

Analyzing the number of active threads versus the number of reserved cores for moderate-sized services (100 shards or more), it becomes evident that almost all services have significantly more threads than reserved cores. It is not uncommon to find tens of runnable threads per reserved core. This makes the earlier example of service-1 appear relatively modest, with 1.5 to 2 runnable threads per reserved core under load. Additionally, it is common practice both inside and outside of Twitter to provision thread pool sizes at 2 times the number of logical cores on the machine. This recommendation is based on workloads like a gcc compile, where idle resources should be minimized. However, applying this practice to Twitter’s applications presents certain challenges:

1. Most applications have multiple competing thread pools, making it difficult to determine an optimal thread pool size.
2. Exceeding the reserved core limit has severe consequences, leading to throttling and compromised performance.
3. Having additional threads devoted to computations can negatively impact latency.

The “2x number of logical cores” provisioning strategy should be revisited in the Twitter context.

**Concluding Remarks**

Efficiency optimization for CPU bound services at Twitter is a paramount concern. By understanding the underlying problems with service configurations and the Linux scheduler, Twitter can proactively address the limitations hampering CPU utilization. Case studies highlight the significant gains achieved through thread pool and core utilization adjustments, leading to improved performance and cost reductions. However, the challenge lies in scaling these optimizations across multiple services, as manual adjustments become impractical. Future exploration should focus on finding scalable solutions that can recapture efficiencies for the majority of services. By doing so, Twitter can maximize its service capacity, enhance user experience, and achieve substantial cost savings.

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Mastering Diffusion Models through Reinforcement Learning – Discover Insights from the Berkeley Artificial Intelligence Research Blog

Video Tutorial: Unleashing the Power of Influencing without Authority