Table of Contents
Fetching ...

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Ying Yuan, Pengfei Zuo, Bo Wang, Zhangyu Chen, Zhipeng Tan, Zhou Yu

TL;DR

This work addresses the core conflict in distributed LLM serving between maximizing KV cache affinity and achieving balanced load under strict TTFT SLOs. It introduces DualMap, a dual-mapping scheduler that assigns each request to two candidate instances via independent hash functions on the request prefix, then selects the better candidate using SLO-aware routing inspired by the power of two choices. DualMap further mitigates real-world hotspots with hotspot-aware rebalancing and enables elastic scaling through a lightweight dual-hash-ring design that minimizes remapping. Experiments on real workloads with Qwen-based models show substantial gains, notably up to 2.25× higher effective capacity under the same TTFT SLO and significant reductions in tail TTFT, demonstrating practical impact for scalable, cache-efficient LLM serving.

Abstract

In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25$\times$ under the same TTFT SLO constraints compared with SOTA work.

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

TL;DR

This work addresses the core conflict in distributed LLM serving between maximizing KV cache affinity and achieving balanced load under strict TTFT SLOs. It introduces DualMap, a dual-mapping scheduler that assigns each request to two candidate instances via independent hash functions on the request prefix, then selects the better candidate using SLO-aware routing inspired by the power of two choices. DualMap further mitigates real-world hotspots with hotspot-aware rebalancing and enables elastic scaling through a lightweight dual-hash-ring design that minimizes remapping. Experiments on real workloads with Qwen-based models show substantial gains, notably up to 2.25× higher effective capacity under the same TTFT SLO and significant reductions in tail TTFT, demonstrating practical impact for scalable, cache-efficient LLM serving.

Abstract

In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25 under the same TTFT SLO constraints compared with SOTA work.
Paper Structure (36 sections, 10 equations, 15 figures, 1 table)

This paper contains 36 sections, 10 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Pareto trade-off between cache hit rate and load balance ratio (coefficient of variation, CV) across different scheduling strategies on the Conversation and Tool&Agent datasets. A lower CV indicates more even load distribution across instances.
  • Figure 2: The system overview of DualMap.
  • Figure 3: Effective request capacity and goodput of different scheduling strategies.
  • Figure 4: TTFT and E2E Latency of different scheduling strategies.
  • Figure 5: Ablation Results under the Conversation workload using the Qwen2.5-14B model.
  • ...and 10 more figures