Table of Contents
Fetching ...

AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

Shuqing Luo, Yilin Guan, Pingzhi Li, Hanrui Wang, Tianlong Chen

TL;DR

AsyncSpade tackles the decoding bottleneck in test-time scaling for long-chain-of-thought reasoning by decoupling KV-cache management from the autoregressive inference through a two-rank, asynchronous architecture. It leverages a lightweight temporal-regressive next-query predictor grounded in temporal locality and linear correlations between adjacent queries to enable token-level KV selection without introducing sequential dependencies. The framework supports multiple attention architectures via batched GEMMs and achieves substantial reductions in time-per-output-token while preserving accuracy on standard TTS benchmarks. Empirically, AsyncSpade delivers over 20% TPOT reduction versus Quest and more than 50% against full attention on Qwen3-8B and Qwen3-32B, demonstrating practical, scalable improvements for real-world LLM serving across diverse tasks.

Abstract

Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).

AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

TL;DR

AsyncSpade tackles the decoding bottleneck in test-time scaling for long-chain-of-thought reasoning by decoupling KV-cache management from the autoregressive inference through a two-rank, asynchronous architecture. It leverages a lightweight temporal-regressive next-query predictor grounded in temporal locality and linear correlations between adjacent queries to enable token-level KV selection without introducing sequential dependencies. The framework supports multiple attention architectures via batched GEMMs and achieves substantial reductions in time-per-output-token while preserving accuracy on standard TTS benchmarks. Empirically, AsyncSpade delivers over 20% TPOT reduction versus Quest and more than 50% against full attention on Qwen3-8B and Qwen3-32B, demonstrating practical, scalable improvements for real-world LLM serving across diverse tasks.

Abstract

Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).

Paper Structure

This paper contains 27 sections, 7 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Performance of Qwen3-32B on AIME24 with Long Decoding. AsyncSpade minimizes the decoding FLOPs while maintaining high performance.
  • Figure 2: Runtime Profiling for Page-level Sparse Decoding. We benchmark the latency breakdown of a single Transformer block in the decoding stage on Qwen3 dense models yang2025qwen3 with an NVIDIA A100 GPU with configurations in \ref{['tab:model-summary']}. We set page size to $16$ following the default setting of FlashInfer yeflashinfer, and select $1/16$ tokens from the full KV cache. (a) reports results for varied batch sizes ($1$–$512$) to emulate different serving concurrency, while (b) reports results for varied context lengths ($4k$–$512k$) to emulate long chain-of-thought decoding.
  • Figure 3: Configurations of the profiled Qwen3 dense models.
  • Figure 4: Overlap ratio for query states across different token distances. The overlap ratio for distance $d$ and token $t$ is examined with ${\mathcal{O}}_{t-d, t}$. We use Qwen3-32B and AIME24 with full attention for the profiling experiments, where the overlap ratio is averaged over the sample and attention head dimensions, and $4$ layers are examined. $1/8$ tokens from the KV cache are selected.
  • Figure 5: Overlap ratio for linear regression w/ & w/o single-token shifting. We follow the same settings as \ref{['fig:locality']}. Given window size $W$, the overlap ratio of token $t$ for linear regression w/o single token shifting is examined by first regressing token $t$ with token $\{t-W,\dots,t-1\}$ and then apply the solved weights also on token $\{t-W,\dots,t-1\}$, while the overlap ratio of token $t$ w/ single token shifting is examined by first regressing token $t-1$ with token $\{t-W-1,\dots,t-2\}$ and then apply the solved weights on token $\{t-W,\dots,t-1\}$. $W=16$ is used for profiling.
  • ...and 5 more figures