Table of Contents
Fetching ...

Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

Payel Bhattacharjee, Fengwei Tian, Meiyu Zhong, Guangyi Zhang, Osvaldo Simeone, Ravi Tandon

TL;DR

Edge-cloud LLM inference is hampered by limited uplink bandwidth, motivating distribution-aware compression of draft-token probabilities. The authors propose Sparse QS (SQS) speculative decoding, combining sparsification with lattice-based quantization to compress distributions while preserving the SD guarantee. They derive an information-theoretic bound on token rejection that decomposes into a mismatch term between $q_n^t$ and $p_n^t$ and a sparsification/quantization distortion term, and they propose two implementations: K-SQS with fixed top-$K$ and C-SQS with online conformal adaptation. Theoretical guarantees accompany practical algorithms, including a bound on sparsification distortion and a conformal-threshold update, and experiments on LM1B with bandwidth constraints demonstrate significant latency and bandwidth reductions, with K-SQS favored in low-uncertainty regimes and C-SQS in higher-uncertainty settings.

Abstract

Edge-cloud speculative decoding (SD) accelerates inference by having a cloud-based large language model (LLM) that verifies draft tokens generated by a resource-constrained small language model (SLM) at the edge. A central bottleneck is the limited bandwidth of the edge-cloud link, which necessitates efficient compression of draft token distributions. We first derive an information-theoretic bound that decomposes the token rejection rate into contributions from SLM-LLM distribution mismatch and from quantization distortion. Guided by this analysis, we propose the Sparse Quantize-and-Sample SD (SQS-SD) framework, which exploits distributional sparsity through structured sparsification and lattice-based quantization. Within this framework, K-SQS applies fixed top-K truncation, while C-SQS adaptively adjusts the retained token set via online conformal prediction to ensure bounded deviation from the dense distribution. Empirical results confirm that both approaches improve end-to-end latency and rejection rates in complimentary operating regimes.

Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

TL;DR

Edge-cloud LLM inference is hampered by limited uplink bandwidth, motivating distribution-aware compression of draft-token probabilities. The authors propose Sparse QS (SQS) speculative decoding, combining sparsification with lattice-based quantization to compress distributions while preserving the SD guarantee. They derive an information-theoretic bound on token rejection that decomposes into a mismatch term between and and a sparsification/quantization distortion term, and they propose two implementations: K-SQS with fixed top- and C-SQS with online conformal adaptation. Theoretical guarantees accompany practical algorithms, including a bound on sparsification distortion and a conformal-threshold update, and experiments on LM1B with bandwidth constraints demonstrate significant latency and bandwidth reductions, with K-SQS favored in low-uncertainty regimes and C-SQS in higher-uncertainty settings.

Abstract

Edge-cloud speculative decoding (SD) accelerates inference by having a cloud-based large language model (LLM) that verifies draft tokens generated by a resource-constrained small language model (SLM) at the edge. A central bottleneck is the limited bandwidth of the edge-cloud link, which necessitates efficient compression of draft token distributions. We first derive an information-theoretic bound that decomposes the token rejection rate into contributions from SLM-LLM distribution mismatch and from quantization distortion. Guided by this analysis, we propose the Sparse Quantize-and-Sample SD (SQS-SD) framework, which exploits distributional sparsity through structured sparsification and lattice-based quantization. Within this framework, K-SQS applies fixed top-K truncation, while C-SQS adaptively adjusts the retained token set via online conformal prediction to ensure bounded deviation from the dense distribution. Empirical results confirm that both approaches improve end-to-end latency and rejection rates in complimentary operating regimes.

Paper Structure

This paper contains 14 sections, 6 theorems, 31 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Consider a sequence of $T$ tokens $\{X_t\}^T_{t=1}$ generated by using an SQS protocol, with corresponding per-token subsets $\mathcal{X}_n$ of cardinality $K_n(\mathcal{X}_n)$ and resolution parameters $\ell_n$ for the tokens $n=1,...,T$. The expected number of rejected tokens can be upper bounded where the expectation $\mathbb{E}_{\{X_t\}^{n-1}_{t=1}\sim p}[\cdot]$ is with respect to tokens gen

Figures (6)

  • Figure 1: Illustration of sparse quantize-and-sample (SQS) framework for edge-cloud speculative decoding for efficient LLM inference. The edge device adaptively sparsifies and quantizes the SLM’s next-token distribution with an updated threshold with a principled online update rule based on online conformal prediction.
  • Figure 2: Latency (average total time in seconds) and resampling rate for $K$-SQS and C-SQS across different temperatures $(T)$. $K$-SQS shows increasing latency and higher variability in resampling rate with increase in $T$, while C-SQS maintains more stable performance, achieving a better trade-off between latency and resampling efficiency in higher-uncertainty regimes.
  • Figure 3: Illustration of the definition of rejected and resampled tokens, $N_{\textrm{rej}}$, and of the total number of rejected tokens.
  • Figure 4: Latency for $K$-SQS and C-SQS methods versys $K$ and $\beta$, respectively, across varying temperature settings.
  • Figure 5: Latency and resampling rate as a function of temperature for C-SQS with and without adaptivity.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • Lemma 3: Step-size envelope
  • Lemma 4: Universal bound on $\beta$
  • proof