Table of Contents
Fetching ...

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo

Abstract

Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

Abstract

Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.

Paper Structure

This paper contains 21 sections, 2 equations, 14 figures, 1 algorithm.

Figures (14)

  • Figure 1: The All-Gather prompt structure. All agents receive the same output blocks ($O$), but the blocks appear at different positions because each prompt has its own private history ($H$) and may use a different block order. This structure arises in any multi-agent application that follows the All-Gather pattern.
  • Figure 2: The scaling gap between multi-agent and independent workloads on a single A100-80GB GPU serving Qwen2.5-14B. Both workloads issue the same total number of subrequests (250), but the multi-agent workload nearly exhausts the KV Cache pool because each agent retains its own copy of the shared context across rounds, whereas independent requests free memory after completion.
  • Figure 3: High similarity of KV Caches after PIC reuse. Because all agents reuse the same shared blocks, their KV Caches differ only at the privately-recomputed positions.
  • Figure 4: Per-request PIC reuse (top) vs. TokenDance's collective reuse (bottom). Existing PIC methods process each agent's shared blocks independently, repeating RoPE rotation and important-position selection $N$ times. TokenDance groups the $N$ requests and performs these operations once for the round.
  • Figure 5: TokenDance Overview. A round-aware prompt interface preserves block boundaries so the runtime can identify shared content; collective KV Cache reuse amortizes the reuse cost across all agents in the round; diff-aware storage with fused restore compresses per-agent KV Caches to only the inter-agent differences.
  • ...and 9 more figures