Table of Contents
Fetching ...

Fast Collaborative Inference via Distributed Speculative Decoding

Ce Zheng, Ke Zhang, Chen Sun, Wenqi Zhang, Qiong Liu, Angesom Ataklity Tesfay

TL;DR

This work tackles the uplink bottleneck in collaborative LLM inference over AI-RAN by introducing Truncated Sparse Logits Transmission (TSLT), a sparsify-then-sample scheme that performs Top-$K$ or Top-$p$ truncation on logits and transmits only a sparse set of values to the edge-LLM, while renormalizing to preserve the full probability mass and guarantee lossless inference. It extends the approach to Multi-Candidate Distributed Speculative Decoding (MC-DSD), enabling parallel verification of multiple draft sequences via a token tree, and provides theoretical guarantees on acceptance-rate robustness using total variation distance analysis. Empirical results show that a small transmitted subset (e.g., Top-$K$ around 10% of the vocabulary) retains most probability mass, maintains high acceptance rates, and yields substantial speedups over single-model inference, especially under constrained uplink rates. The proposed framework offers a scalable, communication-efficient pathway for real-time edge-assisted LLM inference in AI-RAN systems, with future work addressing heterogeneous vocabularies and computation-communication tradeoffs.

Abstract

Speculative decoding accelerates large language model (LLM) inference by allowing a small draft model to predict multiple future tokens for verification by a larger target model. In AI-native radio access networks (AI-RAN), this enables device-edge collaborative inference but introduces significant uplink overhead, as existing distributed speculative decoding schemes transmit full vocabulary logits at every step. We propose a sparsify-then-sample strategy, Truncated Sparse Logits Transmission (TSLT), which transmits only the logits and indices of a truncated candidate set. We provide theoretical guarantees showing that the acceptance rate is preserved under TSLT. TSLT is further extended to multi-candidate case, where multiple draft candidates per step increase acceptance probability. Experiments show that TSLT significantly reduces uplink communication while maintaining end-to-end inference latency and model quality, demonstrating its effectiveness for scalable, communication-efficient distributed LLM inference in future AI-RAN systems.

Fast Collaborative Inference via Distributed Speculative Decoding

TL;DR

This work tackles the uplink bottleneck in collaborative LLM inference over AI-RAN by introducing Truncated Sparse Logits Transmission (TSLT), a sparsify-then-sample scheme that performs Top- or Top- truncation on logits and transmits only a sparse set of values to the edge-LLM, while renormalizing to preserve the full probability mass and guarantee lossless inference. It extends the approach to Multi-Candidate Distributed Speculative Decoding (MC-DSD), enabling parallel verification of multiple draft sequences via a token tree, and provides theoretical guarantees on acceptance-rate robustness using total variation distance analysis. Empirical results show that a small transmitted subset (e.g., Top- around 10% of the vocabulary) retains most probability mass, maintains high acceptance rates, and yields substantial speedups over single-model inference, especially under constrained uplink rates. The proposed framework offers a scalable, communication-efficient pathway for real-time edge-assisted LLM inference in AI-RAN systems, with future work addressing heterogeneous vocabularies and computation-communication tradeoffs.

Abstract

Speculative decoding accelerates large language model (LLM) inference by allowing a small draft model to predict multiple future tokens for verification by a larger target model. In AI-native radio access networks (AI-RAN), this enables device-edge collaborative inference but introduces significant uplink overhead, as existing distributed speculative decoding schemes transmit full vocabulary logits at every step. We propose a sparsify-then-sample strategy, Truncated Sparse Logits Transmission (TSLT), which transmits only the logits and indices of a truncated candidate set. We provide theoretical guarantees showing that the acceptance rate is preserved under TSLT. TSLT is further extended to multi-candidate case, where multiple draft candidates per step increase acceptance probability. Experiments show that TSLT significantly reduces uplink communication while maintaining end-to-end inference latency and model quality, demonstrating its effectiveness for scalable, communication-efficient distributed LLM inference in future AI-RAN systems.

Paper Structure

This paper contains 23 sections, 8 theorems, 54 equations, 6 figures, 1 table, 8 algorithms.

Key Result

Lemma 1

For the $i$-th decoding iteration, the acceptance rate satisfies where $Q_i$ and $P_i$ denote the SLM and LLM output distributions, respectively. And is the total variation distance between the two distributions.

Figures (6)

  • Figure 1: Single Candidate Distributed Speculative Decoding
  • Figure 2: Multi-Candidate Distributed Speculative Decoding
  • Figure 3: Top-$K$ probability mass different $K$
  • Figure 4: Acceptance rate under different $K$
  • Figure 5: Speedup Ratio under different uplink transmission rates (SC-DSD).
  • ...and 1 more figures

Theorems & Definitions (23)

  • Definition 2.1: Acceptance Rate leviathan2023fast
  • Definition 2.2: Speedup Ratio
  • Lemma 1: Acceptance Rate and TV Distance leviathan2023fast
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Definition 4.1: Token Tree
  • Definition 4.2: Expansion-configured Token Tree
  • Definition 4.3
  • ...and 13 more