DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Fengze Yu, Leshu Li, Brad McDanel, Sai Qian Zhang
TL;DR
<3-5 sentence high-level summary>DSD addresses the latency and scalability challenges of serving large language models by extending speculative decoding from single-node setups to distributed edge–cloud environments. It introduces DSD-Sim, a discrete-event simulator that models draft-target coordination, network effects, and batching, and couples it with an adaptive window control (AWC) policy that uses a WC-DNN to predict the optimal speculation window γ in real time. The approach yields tangible improvements in throughput and latency across diverse benchmarks, demonstrating up to 9.7% higher throughput and more efficient decoding compared to fixed policies, with broader implications for scalable, low-latency LLM serving. The work also provides a practical framework for evaluating distributed speculative decoding and informs design choices for policy, batching, and routing in heterogeneous environments.
Abstract
Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.
