Table of Contents
Fetching ...

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse

TL;DR

DroidSpeak tackles the challenge of reusing KV caches across different LLMs with the same architecture to accelerate distributed multi-LLM inference. It engineers a two-stage approach: offline profiling to identify contiguous critical-layer groups for recomputation and online pipelined loading to hide KV/E-cache transfers, selecting configurations to meet latency SLOs. Empirical results show up to 4× throughput and 3.1× prefill speedups with negligible quality loss across eight model pairs and six datasets, outperforming full prefill, full KV reuse, and CacheBlend baselines. The work demonstrates a practical pathway to efficient cross-LLM serving in enterprise settings, while noting limitations such as cross-foundation sharing and data-drift sensitivity.

Abstract

Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question. We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

TL;DR

DroidSpeak tackles the challenge of reusing KV caches across different LLMs with the same architecture to accelerate distributed multi-LLM inference. It engineers a two-stage approach: offline profiling to identify contiguous critical-layer groups for recomputation and online pipelined loading to hide KV/E-cache transfers, selecting configurations to meet latency SLOs. Empirical results show up to 4× throughput and 3.1× prefill speedups with negligible quality loss across eight model pairs and six datasets, outperforming full prefill, full KV reuse, and CacheBlend baselines. The work demonstrates a practical pathway to efficient cross-LLM serving in enterprise settings, while noting limitations such as cross-foundation sharing and data-drift sensitivity.

Abstract

Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question. We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.

Paper Structure

This paper contains 27 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Various scenarios in which same context is shared by multiple LLMs. DroidSpeak brings down the computation latency by up to 3.1$\times$, increases throughput by 4$\times$.
  • Figure 2: Illustration of the use of embedding (E), query (Q), key (K), and value (V) tensors in self-attention in transformer-based LLMs.
  • Figure 3: Prefill and decode phases.
  • Figure 4: Fine-tuned model gives higher accuracy than baseline.
  • Figure 5: Shorter input leads to smaller end-to-end time.
  • ...and 16 more figures