Table of Contents
Fetching ...

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

Abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately -- higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.
Paper Structure (27 sections, 30 equations, 4 figures, 7 tables, 2 algorithms)

This paper contains 27 sections, 30 equations, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: The SFI Framework.(A) Motivation: Attention maps from Qwen3-0.6B illustrate a common pattern of within-sentence support stability: across consecutive decoding steps within a sentence, and more generally within a short semantically coherent span, the dominant attended positions remain largely stable rather than changing abruptly at every step. (B) Slow-Fast paradigm and speedup: SFI decouples decoding into many low-cost Fast Steps and occasional dense Slow Steps. The speedup plot reports the average end-to-end throughput gain across the Qwen series, from 0.6B to 235B, and shows that the advantage of this slow-fast schedule grows with context length.
  • Figure 2: The Slow-Fast Inference framework.Top: Across multiple model scales, attention mass often remains concentrated on a largely stable set of positions within a semantic unit, illustrating within-sentence support stability. Bottom: SFI exploits this pattern by alternating frequent low-cost fast steps, which attend to a managed sparse state (sink + selected + recent), with occasional slow steps. A slow step is triggered when a boundary token is generated (Eq. \ref{['eq:trigger']}) or when a fixed refresh interval is reached; it then performs dense attention, collects masked attention logits over the allowed candidate set, and invokes the Selector to refresh the selected memory for the next segment.
  • Figure 3: System infrastructure for efficient Slow-Fast Inference.(A) Asynchronous pipeline: SFI overlaps the main attention computation with slow-step maintenance across two execution streams. While the Main Stream computes attention for layer $i+1$, the Aux Stream concurrently runs the Selector and cache reorganization for layer $i$, so that most maintenance overhead is hidden behind ongoing layer execution. (B) Memory-coalesced sparse kernel: Native sparse attention suffers from scattered KV reads and poor bandwidth utilization. We therefore use a two-segment layout in which sink and selected tokens are packed into a contiguous compact buffer, enabling high-bandwidth sequential access over most of the sparse context, while recent tokens are read in place from paged KV.
  • Figure 4: End-to-end decoding throughput across model scales and context lengths. Throughput (tok/s) is measured for the full-KV baseline (full kv cache) and SFI with up to 2048 generated tokens per request. SFI consistently improves decoding throughput, and the advantage grows with context length.