Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management
Andrew Jeffery, Chris Jensen, Richard Mortier
TL;DR
The paper addresses tail latency in multi-tenant microservice deployments caused by CPU overcommitment from quota misdetection and neighbour usage. It analyzes how OS threads exceed available CPUs and how different languages' CPU discovery interacts with quotas. It introduces the friendlypool neighbor-aware threadpool that dynamically scales active workers via a control thread and the ratio of cpu_time_self to cpu_time_all, with a tunable overcommitment factor. Empirically, friendlypool can reduce maximum worker latency by up to $6.7\times$ at the cost of up to $1.4\times$ throughput and offers a practical path toward reducing tail latency on modern runtimes.
Abstract
Application tail latency is a key metric for many services, with high latencies being linked directly to loss of revenue. Modern deeply-nested micro-service architectures exacerbate tail latencies, increasing the likelihood of users experiencing them. In this work, we show how CPU overcommitment by OS threads leads to high tail latencies when applications are under heavy load. CPU overcommitment can arise from two operational factors: incorrectly determining the number of CPUs available when under a CPU quota, and the ignorance of neighbour applications and their CPU usage. We discuss different languages' solutions to obtaining the CPUs available, evaluating the impact, and discuss opportunities for a more unified language-independent interface to obtain the number of CPUs available. We then evaluate the impact of neighbour usage on tail latency and introduce a new neighbour-aware threadpool, the friendlypool, that dynamically avoids overcommitment. In our evaluation, the friendlypool reduces maximum worker latency by up to $6.7\times$ at the cost of decreasing throughput by up to $1.4\times$.
