Elastic On-Device LLM Service
Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu
TL;DR
This paper tackles the challenge of running large language models on mobile devices while meeting diverse latency requirements. It introduces ElastiLM, which elastifies both the model and the prompt to adapt to per-request SLOs, through one-shot reordering of permutation-consistent Transformer units and a dual-head Tiny Language Model that orchestrates prompt-model elasticity. Offline preparation yields high-quality sub-models with minimal online switching, and a lightweight TLM guides prompt compression and model selection to satisfy latency constraints without sacrificing accuracy. Empirical results on multiple devices and LLMs show substantial accuracy gains (up to 14.83 percentage points, average 10.45%), with low switching overhead (<1% relative TTFT impact) and feasible offline GPU-hour requirements, demonstrating practical deployment potential for elastic on-device LLM services.
Abstract
On-device Large Language Models (LLMs) are transforming mobile AI, catalyzing applications like UI automation without privacy concerns. Nowadays the common practice is to deploy a single yet powerful LLM as a general task solver for multiple requests. We identify a key system challenge in this paradigm: current LLMs lack the elasticity to serve requests that have diversified Service-Level Objectives (SLOs) on inference latency. To tackle this, we present \sys, an on-device LLM service that elasticizes both the model and the prompt dimension of a full LLM. It incorporates (1) a one-shot neuron-reordering method, which leverages the intrinsic permutation consistency in transformer models to generate high-quality elasticized sub-models with minimal runtime switching overhead; (2) a dual-head tiny language model, which efficiently and effectively refines the prompt and orchestrates the elastification between model and prompt. We implement such an elastic on-device LLM service on multiple COTS smartphones, and evaluate \sys on both standalone NLP/mobile-agent datasets and end-to-end synthesized traces. On diverse SLOs, \sys outperforms 7 strong baselines in (absolute) accuracy by up to 14.83\% and 10.45\% on average, with <1\% TTFT switching overhead, on-par memory consumption and <100 offline GPU hours.
