Table of Contents
Fetching ...

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands

Weiye Wang, Chen Chen, Junxue Zhang, Zhusheng Wang, Hui Yuan, Zixuan Guan, Xiaolong Zheng, Qizhen Weng, Yin Chen, Minyi Guo

Abstract

Distributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling. We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache loading delay as an explicit component of per-request service cost, leading to more accurate scheduling decisions. Experiments on a real testbed with diverse long-context workloads show that CALVO substantially improves the efficiency of network-intensive LLM inference, achieving up to 61.67% higher SLO attainment than the baseline.

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands

Abstract

Distributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling. We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache loading delay as an explicit component of per-request service cost, leading to more accurate scheduling decisions. Experiments on a real testbed with diverse long-context workloads show that CALVO substantially improves the efficiency of network-intensive LLM inference, achieving up to 61.67% higher SLO attainment than the baseline.
Paper Structure (15 sections, 11 figures, 1 table)

This paper contains 15 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Workflow in a typical LLM inference engine (e.g., vLLM) when integrated with a distributed KVCache pooling framework (e.g., LMCache).
  • Figure 2: Breakdown of TTFT when a Llama-3.1-8B model serves requests with varying context-token length (to load from a remote server) yet fixed query-token length. From the figure we learn that KVCache reusing can effectively reduce TTFT, yet KVCache loading now becomes a major bottleneck.
  • Figure 3: Per-stage processing throughput in vLLM-LMCache is suboptimal (the throughput of our later proposed CALVO system is shown for comparison).
  • Figure 4: Given two network-intensive requests, FIFO or compute-based SJF may prolong the average TTFT.
  • Figure 5: Workflow of CALVO.
  • ...and 6 more figures