Table of Contents
Fetching ...

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Häggström, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Håkan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

TL;DR

SnapStream tackles memory bottlenecks in production LLM inference by integrating KV-cache compression with continuous batching in static-graph deployments. It fuses SnapKV-based prefill compression with StreamingLLM-driven decoding using a ring-buffer KV cache to support long-context sequences within fixed memory footprints. The approach is realized through a static-graph mapping on SN40L hardware, achieving ~4x KV-cache memory reduction and up to 4.3x decoding throughput with modest prefill latency overhead, while preserving accuracy on large-scale long-context and reasoning benchmarks. This work demonstrates the practicality of training-free KV-cache compression in production inference and informs hardware-aware design for scalable long-context LLM serving.

Abstract

The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

TL;DR

SnapStream tackles memory bottlenecks in production LLM inference by integrating KV-cache compression with continuous batching in static-graph deployments. It fuses SnapKV-based prefill compression with StreamingLLM-driven decoding using a ring-buffer KV cache to support long-context sequences within fixed memory footprints. The approach is realized through a static-graph mapping on SN40L hardware, achieving ~4x KV-cache memory reduction and up to 4.3x decoding throughput with modest prefill latency overhead, while preserving accuracy on large-scale long-context and reasoning benchmarks. This work demonstrates the practicality of training-free KV-cache compression in production inference and informs hardware-aware design for scalable long-context LLM serving.

Abstract

The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

Paper Structure

This paper contains 26 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Common KV cache compression methods like SnapKV li2024snapkvllmknowslooking perform compression when the input sequence reaches length $L_\text{threshold}$. (b) Continuous Batching deployments consist of two graphs: a prefill graph that produces a single new token and the KV cache, and a decode graph that generates the next token and an updated KV cache. It's unclear where KV cache compression can be performed in this process, as the threshold $L_\text{threshold}$ can be reached during either prefill or decode and at different times for different batch elements.
  • Figure 2: SN40L Architecture. Packaged as a two-die socket in 5FF TSMC process. Each die features 2 dense compute Tiles, 2 HBM modules, and 3 DDR channels. Tiles are interconnected via the Top Level Network (TLN) and can communicate with other RDUs using the P2P interfaces. Each Tile is comprised of PCUs and PMUs connected in a mesh network, RDN, enabling seamless data exchange.
  • Figure 3: SnapStream applies SnapKV during prefill (b) to produce a compressed KV cache and StreamingLLM during decoding (d) to update the recent tokens of the compressed cache in-place. In contrast, standard static graph prefill (a) produces a padded KV cache that is appended to during decoding (c).
  • Figure 4: An example of how the SnapStream ring buffer is constructed during prefill, and how it is updated during decoding. See Listing \ref{['lst:prefillcompression']} in the Appendix for prefill pseudocode. Given an input sequence with $L=26$, $L_{\text{sink}} = 1$, $L_{\text{recent}} = 4$, we gather KVs from indices 21-24 as Range 1 and 25-28 as Range 2. The ring buffer is constructed with indices 0-1 from Range 2 and indices 2-3 from Range 1. During decoding, we replace the KV for token index 23 with the KV for the newly generated index 27.
  • Figure 5: High-level block diagram of the modified MoE prefill graph incorporating SnapStream compression. The graph is decomposed into multiple fused kernels, indicated by green boxes.
  • ...and 2 more figures