Table of Contents
Fetching ...

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

TL;DR

This work tackles the latency of autoregressive language-model inference by introducing distributed speculative inference (DSI), which harnesses speculation parallelism (SP) to overlap drafting and verification across multiple target and drafter instances. The method provides provable lossless speedups over both speculative inference (SI) and non-SI, and it scales with hardware via a tunable SP degree ($SP$) and lookahead ($L$). Theoretical analysis under standard time-assumptions shows DSI is at least as fast as, and often strictly faster than, SI and non-SI in expectation, while empirical results on a single node with up to eight GPUs report speedups of about $1.29$–$1.92\times$ across several models and tasks. The work also demonstrates robustness through offline simulations and offers open-source code to facilitate broader adoption and further exploration of SP-based orchestration for lossless LM inference.

Abstract

This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation parallelism (SP), a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

TL;DR

This work tackles the latency of autoregressive language-model inference by introducing distributed speculative inference (DSI), which harnesses speculation parallelism (SP) to overlap drafting and verification across multiple target and drafter instances. The method provides provable lossless speedups over both speculative inference (SI) and non-SI, and it scales with hardware via a tunable SP degree () and lookahead (). Theoretical analysis under standard time-assumptions shows DSI is at least as fast as, and often strictly faster than, SI and non-SI in expectation, while empirical results on a single node with up to eight GPUs report speedups of about across several models and tasks. The work also demonstrates robustness through offline simulations and offers open-source code to facilitate broader adoption and further exploration of SP-based orchestration for lossless LM inference.

Abstract

This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation parallelism (SP), a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.
Paper Structure (27 sections, 6 theorems, 8 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 6 theorems, 8 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Under Assumptions assumption:sampling_time_is_bounded, assumption:fm_slowest and assumption:sum, Algorithm alg:concurrent_informal returns the same output and runs at least as fast as running the target model itself without speculative inference (SI).

Figures (7)

  • Figure 1: Illustration of the timeline for DSI, SI, and autoregressive inference (non-SI). Blue and yellow mark the forward latency of the target and drafter, respectively. In this example we have $\texttt{lookahead}\xspace=1$, namely, the drafter generates a single token in every yellow square. Non-SI and SI are both sequential: each of their iterations ends with a target forward, and this target forward must be completed before the next iteration can start. In DSI, target forwards are not necessarily blocking as in SI and non-SI. While DSI works for any given number of GPUs ($\ge 2$), here it orchestrates eight GPUs.
  • Figure 2: Expected pairwise speedups (or slowdowns) of DSI, SI, and non-SI. Each heatmap is labeled "X/Y" and plots the ratio between the run time of algorithm X and the run time of algorithm Y. The run time of each algorithm is computed by summing the latencies of all the forward passes and intentionally ignoring additional real-world latencies of multithreading systems like context switching, allowing us to decouple the implementation details from the theoretical analysis. (a): SI is slower than non-speculative inference (non-SI) when the drafter is either slow or inaccurate enough (pink marks slowdowns). (b, c, d): DSI is faster than speculative inference (SI) and non-speculative inference (non-SI) for all configurations of non-zero acceptance rate. DSI is never slower than either SI or non-SI for all configurations. (d): DSI is up to 1.6x faster than the baseline algorithm, where the baseline is the faster between SI and non-SI for each configuration.
  • Figure 3: MBPP Prompt
  • Figure 4: CNN-DM Prompt
  • Figure 5: Alpaca prompt for samples with a non-empty input field.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Theorem 2
  • proof
  • Proposition 1
  • proof
  • Theorem 2
  • proof