Table of Contents
Fetching ...

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Fengze Yu, Leshu Li, Brad McDanel, Sai Qian Zhang

TL;DR

<3-5 sentence high-level summary>DSD addresses the latency and scalability challenges of serving large language models by extending speculative decoding from single-node setups to distributed edge–cloud environments. It introduces DSD-Sim, a discrete-event simulator that models draft-target coordination, network effects, and batching, and couples it with an adaptive window control (AWC) policy that uses a WC-DNN to predict the optimal speculation window γ in real time. The approach yields tangible improvements in throughput and latency across diverse benchmarks, demonstrating up to 9.7% higher throughput and more efficient decoding compared to fixed policies, with broader implications for scalable, low-latency LLM serving. The work also provides a practical framework for evaluating distributed speculative decoding and informs design choices for policy, batching, and routing in heterogeneous environments.

Abstract

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

TL;DR

<3-5 sentence high-level summary>DSD addresses the latency and scalability challenges of serving large language models by extending speculative decoding from single-node setups to distributed edge–cloud environments. It introduces DSD-Sim, a discrete-event simulator that models draft-target coordination, network effects, and batching, and couples it with an adaptive window control (AWC) policy that uses a WC-DNN to predict the optimal speculation window γ in real time. The approach yields tangible improvements in throughput and latency across diverse benchmarks, demonstrating up to 9.7% higher throughput and more efficient decoding compared to fixed policies, with broader implications for scalable, low-latency LLM serving. The work also provides a practical framework for evaluating distributed speculative decoding and informs design choices for policy, batching, and routing in heterogeneous environments.

Abstract

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.

Paper Structure

This paper contains 36 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (a) Distributed edge–cloud environment for speculative decoding. (b) Joint processing between edge and cloud servers during SD operation. (c) Illustration of the speculative decoding workflow.
  • Figure 2: The configuration parser ingests YAML configuration files and workload traces, routing requests through the DSD scheduler. The scheduler then coordinates the hardware modeling engine to generate detailed system performance outputs, which are subsequently processed by the performance analyzer for SLO evaluation.
  • Figure 3: The WC-DNN architecture takes five input features and generates the optimal prediction for the speculation window size. The edge LLM then adjusts its window size based on this prediction and coordinates with the cloud LLM for verification.
  • Figure 4: GPU-level calibration of predicted vs. actual inference latencies for prefill and decode across Qwen-7B, Qwen-72B, Llama-2-7B, and Llama-2-70B on A40, A100, and H100 GPUs. Error bars indicate standard deviation over 100 requests.
  • Figure 5: End-to-end SLOs and throughput for policy stacks. Default: Random routing + FIFO queueing + Static $\gamma$. Setting 1: JSQ + FIFO + Static $\gamma$. Setting 2: JSQ + Length-Aware Batching (LAB) + Static $\gamma$. Setting 3: JSQ + Length-Aware Batching + Dynamic $\gamma$. Setting 4: JSQ + Length-Aware Batching + AWC.
  • ...and 5 more figures