Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices
Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesterer, João Paulo Cardoso de Lima, Jeronimo Castrillon
TL;DR
Edge-LMMs suffer high latency on resource-constrained devices; the paper proposes compiler-assisted speculative sampling with a cost model to jointly optimize when to speculate and how to map the work onto heterogeneous edge PUs. Leveraging IREE/MLIR, the approach enables hardware-aware draft/target partitioning and end-to-end optimization, validated on silicon. Key contributions include a cost-model-guided decision framework, compiler-level abstractions for speculative sampling, and empirical gains of up to $1.68\times$ on real edge hardware with a measured deviation of about $4\%$ from analytic predictions. This work enables practical, low-latency edge deployment of autoregressive LLMs by unifying algorithmic acceleration with hardware-aware compiler mappings.
Abstract
LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68$\times$ speedup for translation tasks, closely matching analytic expectations.
