DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Ying Yuan; Pengfei Zuo; Bo Wang; Zhangyu Chen; Zhipeng Tan; Zhou Yu

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Ying Yuan, Pengfei Zuo, Bo Wang, Zhangyu Chen, Zhipeng Tan, Zhou Yu

TL;DR

This work addresses the core conflict in distributed LLM serving between maximizing KV cache affinity and achieving balanced load under strict TTFT SLOs. It introduces DualMap, a dual-mapping scheduler that assigns each request to two candidate instances via independent hash functions on the request prefix, then selects the better candidate using SLO-aware routing inspired by the power of two choices. DualMap further mitigates real-world hotspots with hotspot-aware rebalancing and enables elastic scaling through a lightweight dual-hash-ring design that minimizes remapping. Experiments on real workloads with Qwen-based models show substantial gains, notably up to 2.25× higher effective capacity under the same TTFT SLO and significant reductions in tail TTFT, demonstrating practical impact for scalable, cache-efficient LLM serving.

Abstract

In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25$\times$ under the same TTFT SLO constraints compared with SOTA work.

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

TL;DR

Abstract

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Authors

TL;DR

Abstract

Table of Contents

Figures (15)