Table of Contents
Fetching ...

ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Xinhang Chen, Chao Zhang, Jiahuan He, Wei Liu, Jianming Zhang, Wenlong Zhou, Xiao Li, Pai Zeng, Shiyong Li, Yuanpan Qian, Dong Li, Zhaogeng Li

TL;DR

This work tackles Decode-stage bottlenecks in long-context LLM serving by introducing ESS, an offload-centric architecture that shifts Latent-Cache to CPU while preserving GPU latency-critical operations. Leveraging Unified Virtual Addressing, FlashTrans for efficient small-block transfers, LRU-based cache management, warmup, and compute–communication overlap techniques, ESS decouples batch-size growth from fixed GPU memory and boosts Decode throughput. A high-fidelity simulator validates performance, showing substantial throughput gains (e.g., 69.4% at 32K and 123% at 128K contexts) and demonstrating the method’s scalability for long-context inference. The work provides a practical blueprint for memory-constraint decoupling in real-world LLM serving, with clear pathways for integration and future enhancements such as combining with lossy KV-cache compression.

Abstract

DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4\% throughput improvement at 32K context length and up to 123\% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.

ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

TL;DR

This work tackles Decode-stage bottlenecks in long-context LLM serving by introducing ESS, an offload-centric architecture that shifts Latent-Cache to CPU while preserving GPU latency-critical operations. Leveraging Unified Virtual Addressing, FlashTrans for efficient small-block transfers, LRU-based cache management, warmup, and compute–communication overlap techniques, ESS decouples batch-size growth from fixed GPU memory and boosts Decode throughput. A high-fidelity simulator validates performance, showing substantial throughput gains (e.g., 69.4% at 32K and 123% at 128K contexts) and demonstrating the method’s scalability for long-context inference. The work provides a practical blueprint for memory-constraint decoupling in real-world LLM serving, with clear pathways for integration and future enhancements such as combining with lossy KV-cache compression.

Abstract

DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4\% throughput improvement at 32K context length and up to 123\% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.

Paper Structure

This paper contains 17 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 2: Intra-Layer Similarity Across Different Context Lengths.
  • Figure 3: Latent-Cache Offload–Prefetch Timing in the PD disaggregation Setting.
  • Figure 5: Intra-Layer Cache Miss Analysis.
  • Figure 6: Comparison of Overlap Strategies.
  • Figure 7: Overhead Comparison of Overlap Strategies.
  • ...and 5 more figures