Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

Andrew Jeffery; Chris Jensen; Richard Mortier

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

Andrew Jeffery, Chris Jensen, Richard Mortier

TL;DR

The paper addresses tail latency in multi-tenant microservice deployments caused by CPU overcommitment from quota misdetection and neighbour usage. It analyzes how OS threads exceed available CPUs and how different languages' CPU discovery interacts with quotas. It introduces the friendlypool neighbor-aware threadpool that dynamically scales active workers via a control thread and the ratio of cpu_time_self to cpu_time_all, with a tunable overcommitment factor. Empirically, friendlypool can reduce maximum worker latency by up to $6.7\times$ at the cost of up to $1.4\times$ throughput and offers a practical path toward reducing tail latency on modern runtimes.

Abstract

Application tail latency is a key metric for many services, with high latencies being linked directly to loss of revenue. Modern deeply-nested micro-service architectures exacerbate tail latencies, increasing the likelihood of users experiencing them. In this work, we show how CPU overcommitment by OS threads leads to high tail latencies when applications are under heavy load. CPU overcommitment can arise from two operational factors: incorrectly determining the number of CPUs available when under a CPU quota, and the ignorance of neighbour applications and their CPU usage. We discuss different languages' solutions to obtaining the CPUs available, evaluating the impact, and discuss opportunities for a more unified language-independent interface to obtain the number of CPUs available. We then evaluate the impact of neighbour usage on tail latency and introduce a new neighbour-aware threadpool, the friendlypool, that dynamically avoids overcommitment. In our evaluation, the friendlypool reduces maximum worker latency by up to $6.7\times$ at the cost of decreasing throughput by up to $1.4\times$.

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

TL;DR

at the cost of up to

throughput and offers a practical path toward reducing tail latency on modern runtimes.

Abstract

at the cost of decreasing throughput by up to

Paper Structure (15 sections, 1 equation, 8 figures, 2 tables)

This paper contains 15 sections, 1 equation, 8 figures, 2 tables.

Introduction
Overcomitment: OS Threads > CPUs
Computation Model
Overcommitment
How Do Applications Tune Themselves?
Working Under CPU Quotas
Getting the Correct CPU Count
Capitalising On Spare Resources
CPU Quotas Are Wasteful
Enabling Bursts
The Cost Of Ignoring Your Neighbours
Noisy Neighbourhood
Dynamically Adapting To Neighbours
Related Work
Conclusion

Figures (8)

Figure 1: Structure of the applications and the latencies being measured.
Figure 2: Overall latency and throughput at various amounts of OS threads in Rust. Workers have no contention.
Figure 3: Overall latency and throughput at various amounts of OS thread overcommitment in Rust. Workers have contention over the fib computation starting with a lock at fib(30).
Figure 4: Example schedulings, shown in scheduling periods of 2 apps with CPU quotas equivalent to 1 CPU on a 2 CPU system.
Figure 5: Impact of using an incorrect OS thread count when under a CPU quota in Go.
...and 3 more figures

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

TL;DR

Abstract

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

Authors

TL;DR

Abstract

Table of Contents

Figures (8)