Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
Jungyeon Koh, Hyun Jong Yang
TL;DR
This work tackles efficient on-device LLM inference in mobile-edge MEC by introducing a unified user association and resource allocation (UARA) framework that enables parallel speculative decoding. It combines a Two-Phase Matching-based Association (TMA) with a Multi-Agent Soft Actor-Critic (MASAC) resource allocator to jointly optimize offloading decisions, bandwidth, and computing resources under time-slotted channel dynamics. Realistic evaluation using the Sionna simulator demonstrates significant end-to-end latency reductions (up to 28% and average 23.7%) without sacrificing accuracy, and reveals tunable energy-delay tradeoffs via an energy-weight parameter. The approach advances scalable, energy-aware collaborative LLM services in dense MEC setups by tightly coupling computation, communication, and speculative decoding.
Abstract
The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
