Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
Chenxiang Ma, Zhisheng Ye, Hanyu Zhao, Zehua Yang, Tianhao Fu, Jiaxun Han, Jie Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Yong Li, Diyu Zhou
TL;DR
This work tackles the memory pressure of large language model inference by introducing Select-N, a latency-SLO-aware offloading system that moves state between GPU memory and host memory. It leverages a deterministic per-layer compute time to define an offloading interval, and employs a two-stage optimization: an offline performance-record generation to identify optimal intervals for given SLOs, sequence lengths, and batch sizes, followed by online per-bus coordination to adapt to PCIe bandwidth contention. Compared with DeepSpeed and FlexGen, Select-N consistently meets latency SLOs, delivers up to 1.85× throughput improvements, and achieves up to 2.37× more host memory savings by enabling larger input sequences and batches. The approach enables deployment of larger models (e.g., LLaMA-13B, OPT-13B) with real-time responsiveness, making memory offloading practical for interactive LLM services.
Abstract
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger models, longer inputs, and larger batch sizes. However, the design of existing memory offloading mechanisms does not take latency service-level objectives (SLOs) into consideration. As a result, they either lead to frequent SLO violations or underutilize host memory, thereby incurring economic loss and thus defeating the purpose of memory offloading. This paper presents Select-N, a latency-SLO-aware memory offloading system for LLM serving. A key challenge in designing Select-N is to reconcile the tension between meeting SLOs and maximizing host memory usage. Select-N overcomes it by exploiting a unique characteristic of modern LLMs: during serving, the computation time of each decoder layer is deterministic. Leveraging this, Select-N introduces offloading interval, an internal tunable knob that captures the tradeoff between SLOs and host memory usage, thereby reducing the aforementioned challenge to pick an optimal offloading interval. With that, Select-N proposes a two-stage approach to automatically pick the offloading interval. The first stage is offline that generates the range of optimal offloading interval, while the second stage adjusts offloading interval at the granularity of inference iteration based on runtime hardware status. Our evaluation shows that Select-N consistently meets SLOs and improves the serving throughput over existing mechanisms by 1.85X due to maximizing the use of host memory.
