Table of Contents
Fetching ...

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Jungyeon Koh, Hyun Jong Yang

TL;DR

This work tackles efficient on-device LLM inference in mobile-edge MEC by introducing a unified user association and resource allocation (UARA) framework that enables parallel speculative decoding. It combines a Two-Phase Matching-based Association (TMA) with a Multi-Agent Soft Actor-Critic (MASAC) resource allocator to jointly optimize offloading decisions, bandwidth, and computing resources under time-slotted channel dynamics. Realistic evaluation using the Sionna simulator demonstrates significant end-to-end latency reductions (up to 28% and average 23.7%) without sacrificing accuracy, and reveals tunable energy-delay tradeoffs via an energy-weight parameter. The approach advances scalable, energy-aware collaborative LLM services in dense MEC setups by tightly coupling computation, communication, and speculative decoding.

Abstract

The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

TL;DR

This work tackles efficient on-device LLM inference in mobile-edge MEC by introducing a unified user association and resource allocation (UARA) framework that enables parallel speculative decoding. It combines a Two-Phase Matching-based Association (TMA) with a Multi-Agent Soft Actor-Critic (MASAC) resource allocator to jointly optimize offloading decisions, bandwidth, and computing resources under time-slotted channel dynamics. Realistic evaluation using the Sionna simulator demonstrates significant end-to-end latency reductions (up to 28% and average 23.7%) without sacrificing accuracy, and reveals tunable energy-delay tradeoffs via an energy-weight parameter. The approach advances scalable, energy-aware collaborative LLM services in dense MEC setups by tightly coupling computation, communication, and speculative decoding.

Abstract

The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.

Paper Structure

This paper contains 18 sections, 10 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of (a) conventional and (b) parallel speculative decoding. (Top) Example of token generation. In (a), the draft model waits for server-side verification and the target model waits for draft generation. In contrast, (b) allows continuous generation for both models, resulting in faster task completion. (Bottom) End-to-end timeline illustrating the interplay between mobile computation, network transmission and server computation. In (a), the sequential draft-then-verify process introduces prolonged idle time. In contrast, (b) reduces idle periods by enabling concurrent execution. The shaded regions highlight the mutual waiting problem unique to the conventional approach and the communication overhead present in both methods.
  • Figure 2: Latency of (a) conventional and (b) parallel speculative decoding using HumanEval dataset. (Left) Latency is broken down into mobile computation, network transmission and server computation for varying draft lengths. (Right)Total latency is measured across different data rates.
  • Figure 3: Overview of the proposed MEC system. Key UARA decision variables are indicated in red.
  • Figure 4: Illustration of the proposed TMA-MASAC.
  • Figure 5: Average latency under different numbers of (left) mobile devices and (right) edge servers.
  • ...and 1 more figures