Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing
Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, Jae-Mo Kang
TL;DR
This work tackles the practical problem of running autoregressive LLMs on memory- and latency-constrained edge devices by introducing an autoregressive-aware split computing framework. It combines One-Point Split Compression (OPSC) to memory-limit the edge-side model, a two-stage intermediate-output compression (Threshold Splitting + Token-Wise Adaptive Bit Quantization) to dramatically cut data transfer, and a unified optimization that jointly selects split points, quantization, and sequence lengths under tight constraints. The approach is validated across multiple LLMs and hardware, showing improvements in server-side latency and communication overhead while preserving or enhancing accuracy, outperforming state-of-the-art methods such as SmoothQuant, OmniQuant, and Atom. The results demonstrate the framework’s practicality for real-world edge–cloud LLM deployments, enabling scalable, responsive on-device AI in resource-constrained environments.
Abstract
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.
