Table of Contents
Fetching ...

Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesterer, João Paulo Cardoso de Lima, Jeronimo Castrillon

TL;DR

Edge-LMMs suffer high latency on resource-constrained devices; the paper proposes compiler-assisted speculative sampling with a cost model to jointly optimize when to speculate and how to map the work onto heterogeneous edge PUs. Leveraging IREE/MLIR, the approach enables hardware-aware draft/target partitioning and end-to-end optimization, validated on silicon. Key contributions include a cost-model-guided decision framework, compiler-level abstractions for speculative sampling, and empirical gains of up to $1.68\times$ on real edge hardware with a measured deviation of about $4\%$ from analytic predictions. This work enables practical, low-latency edge deployment of autoregressive LLMs by unifying algorithmic acceleration with hardware-aware compiler mappings.

Abstract

LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68$\times$ speedup for translation tasks, closely matching analytic expectations.

Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

TL;DR

Edge-LMMs suffer high latency on resource-constrained devices; the paper proposes compiler-assisted speculative sampling with a cost model to jointly optimize when to speculate and how to map the work onto heterogeneous edge PUs. Leveraging IREE/MLIR, the approach enables hardware-aware draft/target partitioning and end-to-end optimization, validated on silicon. Key contributions include a cost-model-guided decision framework, compiler-level abstractions for speculative sampling, and empirical gains of up to on real edge hardware with a measured deviation of about from analytic predictions. This work enables practical, low-latency edge deployment of autoregressive LLMs by unifying algorithmic acceleration with hardware-aware compiler mappings.

Abstract

LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68 speedup for translation tasks, closely matching analytic expectations.
Paper Structure (20 sections, 2 equations, 7 figures, 3 tables)

This paper contains 20 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Three generative pipelines: standard sampling (top, in blue), and two variants of SD: sequential drafting (center) and tree-based drafting (bottom). Adapted from miao_specinfer_2024.
  • Figure 2: Overview of heterogeneous mapping workflow for speculative sampling on edge devices.
  • Figure 3: Monolithic approach: single IREE module with target, drafter, and control flow subgraphs with heterogeneous device affinities.
  • Figure 4: Modular approach: separate IREE modules for model arithmetic with control flow in the serving platform.
  • Figure 5: Acceptance rate $\alpha$ distribution for different quantization schemes: FP (FP32), T (target), D (drafter).
  • ...and 2 more figures