Table of Contents
Fetching ...

ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong

TL;DR

Asynchronous Test-Time Scaling (Asynchronous Test-Time Scaling) is introduced, a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address challenges of large language models and revisiting arithmetic intensity.

Abstract

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, ATTS identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MATH, AMC23, AIME24, and AIME25 datasets and across multiple draft-target model families, we show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.

ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

TL;DR

Asynchronous Test-Time Scaling (Asynchronous Test-Time Scaling) is introduced, a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address challenges of large language models and revisiting arithmetic intensity.

Abstract

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, ATTS identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MATH, AMC23, AIME24, and AIME25 datasets and across multiple draft-target model families, we show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.

Paper Structure

This paper contains 44 sections, 6 theorems, 48 equations, 11 figures, 3 tables.

Key Result

Proposition 1

Suppose that $\{(X_i, Y_i)\}_{i=1}^{n}$ are exchangeable random variables from the test dataset, and $\xi \sim \text{Uniform}\{1, 2, \ldots, n\}$ represents randomly sampling one data point from the test dataset, where $k$ denotes the $k$-th sample of that data point, then the marginal conformal $p$ is valid in the sense that for the miscoverage rate $\alpha \in (0,1)$, we have Moreover, if the c

Figures (11)

  • Figure 1: Memory Overhead vs. Sampling Sizes (QwQ 32B, Token Budget 500)
  • Figure 2: Comparison of naive and asynchronous speculative decoding.
  • Figure 3: Execution cost comparison between synchronous and asynchronous test-time scaling.
  • Figure 4: Analysis of arithmetic intensity.
  • Figure 5: Asynchronous test-time scaling pipeline. The green box illustrates parallel scaling and follows the rejection sampling procedure, while the blue box illustrates sequential scaling.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Proposition 1
  • proof : Proof of Proposition 1
  • Proposition 2
  • proof
  • Proposition 3
  • proof : Proof of Proposition 3
  • Theorem 1
  • proof
  • Proposition 4
  • proof
  • ...and 2 more