Table of Contents
Fetching ...

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

TL;DR

This work tackles the practical problem of running autoregressive LLMs on memory- and latency-constrained edge devices by introducing an autoregressive-aware split computing framework. It combines One-Point Split Compression (OPSC) to memory-limit the edge-side model, a two-stage intermediate-output compression (Threshold Splitting + Token-Wise Adaptive Bit Quantization) to dramatically cut data transfer, and a unified optimization that jointly selects split points, quantization, and sequence lengths under tight constraints. The approach is validated across multiple LLMs and hardware, showing improvements in server-side latency and communication overhead while preserving or enhancing accuracy, outperforming state-of-the-art methods such as SmoothQuant, OmniQuant, and Atom. The results demonstrate the framework’s practicality for real-world edge–cloud LLM deployments, enabling scalable, responsive on-device AI in resource-constrained environments.

Abstract

Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

TL;DR

This work tackles the practical problem of running autoregressive LLMs on memory- and latency-constrained edge devices by introducing an autoregressive-aware split computing framework. It combines One-Point Split Compression (OPSC) to memory-limit the edge-side model, a two-stage intermediate-output compression (Threshold Splitting + Token-Wise Adaptive Bit Quantization) to dramatically cut data transfer, and a unified optimization that jointly selects split points, quantization, and sequence lengths under tight constraints. The approach is validated across multiple LLMs and hardware, showing improvements in server-side latency and communication overhead while preserving or enhancing accuracy, outperforming state-of-the-art methods such as SmoothQuant, OmniQuant, and Atom. The results demonstrate the framework’s practicality for real-world edge–cloud LLM deployments, enabling scalable, responsive on-device AI in resource-constrained environments.

Abstract

Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.

Paper Structure

This paper contains 24 sections, 15 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Schematic diagram of three scenarios for deploying LLM to edge devices. (a) local computing, (b) edge computing, and (c) split computing (SC)
  • Figure 2: (a) One-point split compression schematic. (b) Intermediate output of LLM
  • Figure 3: An example of the overall pipeline for applying the proposed intermediate output compression technique.
  • Figure 4: Effect of intermediate output magnitudes-based (Clamping) on the Llama-2 13B model on performance in HellaSwag. (a) Accuracy depending on the upper limit setting of the intermediate output's large value. (b) Distribution of values in intermediate output.
  • Figure 5: (a) Total server inference time (in minutes) versus the number of edge devices for three configurations: 'Cloud-only' (all tokens processed by the server) and our SC method with $\bar{W} = 250$ and $\bar{W} = 350$. (b) Number of tokens generated by the server as $\bar{W}$ varies. Our approach gradually offloads more inference steps to the edge device, significantly reducing both server inference time and token generation overhead.
  • ...and 2 more figures