Table of Contents
Fetching ...

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

Andrew Jeffery, Chris Jensen, Richard Mortier

TL;DR

The paper addresses tail latency in multi-tenant microservice deployments caused by CPU overcommitment from quota misdetection and neighbour usage. It analyzes how OS threads exceed available CPUs and how different languages' CPU discovery interacts with quotas. It introduces the friendlypool neighbor-aware threadpool that dynamically scales active workers via a control thread and the ratio of cpu_time_self to cpu_time_all, with a tunable overcommitment factor. Empirically, friendlypool can reduce maximum worker latency by up to $6.7\times$ at the cost of up to $1.4\times$ throughput and offers a practical path toward reducing tail latency on modern runtimes.

Abstract

Application tail latency is a key metric for many services, with high latencies being linked directly to loss of revenue. Modern deeply-nested micro-service architectures exacerbate tail latencies, increasing the likelihood of users experiencing them. In this work, we show how CPU overcommitment by OS threads leads to high tail latencies when applications are under heavy load. CPU overcommitment can arise from two operational factors: incorrectly determining the number of CPUs available when under a CPU quota, and the ignorance of neighbour applications and their CPU usage. We discuss different languages' solutions to obtaining the CPUs available, evaluating the impact, and discuss opportunities for a more unified language-independent interface to obtain the number of CPUs available. We then evaluate the impact of neighbour usage on tail latency and introduce a new neighbour-aware threadpool, the friendlypool, that dynamically avoids overcommitment. In our evaluation, the friendlypool reduces maximum worker latency by up to $6.7\times$ at the cost of decreasing throughput by up to $1.4\times$.

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

TL;DR

The paper addresses tail latency in multi-tenant microservice deployments caused by CPU overcommitment from quota misdetection and neighbour usage. It analyzes how OS threads exceed available CPUs and how different languages' CPU discovery interacts with quotas. It introduces the friendlypool neighbor-aware threadpool that dynamically scales active workers via a control thread and the ratio of cpu_time_self to cpu_time_all, with a tunable overcommitment factor. Empirically, friendlypool can reduce maximum worker latency by up to at the cost of up to throughput and offers a practical path toward reducing tail latency on modern runtimes.

Abstract

Application tail latency is a key metric for many services, with high latencies being linked directly to loss of revenue. Modern deeply-nested micro-service architectures exacerbate tail latencies, increasing the likelihood of users experiencing them. In this work, we show how CPU overcommitment by OS threads leads to high tail latencies when applications are under heavy load. CPU overcommitment can arise from two operational factors: incorrectly determining the number of CPUs available when under a CPU quota, and the ignorance of neighbour applications and their CPU usage. We discuss different languages' solutions to obtaining the CPUs available, evaluating the impact, and discuss opportunities for a more unified language-independent interface to obtain the number of CPUs available. We then evaluate the impact of neighbour usage on tail latency and introduce a new neighbour-aware threadpool, the friendlypool, that dynamically avoids overcommitment. In our evaluation, the friendlypool reduces maximum worker latency by up to at the cost of decreasing throughput by up to .
Paper Structure (15 sections, 1 equation, 8 figures, 2 tables)

This paper contains 15 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Structure of the applications and the latencies being measured.
  • Figure 2: Overall latency and throughput at various amounts of OS threads in Rust. Workers have no contention.
  • Figure 3: Overall latency and throughput at various amounts of OS thread overcommitment in Rust. Workers have contention over the fib computation starting with a lock at fib(30).
  • Figure 4: Example schedulings, shown in scheduling periods of 2 apps with CPU quotas equivalent to 1 CPU on a 2 CPU system.
  • Figure 5: Impact of using an incorrect OS thread count when under a CPU quota in Go.
  • ...and 3 more figures