Table of Contents
Fetching ...

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu

TL;DR

NEO is presented, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput.

Abstract

Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5$\times$, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

TL;DR

NEO is presented, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput.

Abstract

Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.

Paper Structure

This paper contains 18 sections, 1 equation, 17 figures, 1 table.

Figures (17)

  • Figure 1: Workflow of transformer-based LLM Inference.
  • Figure 2: Overall architecture of Neo. "runQ" means "runqueue".
  • Figure 3: Simple offloading strawman offloads all requests' KV cache and decoding attention computation to the CPU. "Comm" stands for GPU-CPU communication; "TrQKV" means transferring Q,K,V tensors to CPU; "TrO" means transferring attention output to GPU.
  • Figure 4: Symmetric pipelining strawman forms two identical sub-batches and overlaps linear and attention operations for the decoding stage. The red and blue arrows depict the data flows of the two sub-batches. "pr" means pre-projection and "po" means post-projection + FFN operations; together they form the linear stage. "attn" means attention operations. "TrQKV"s and "TrO"s are omitted for simplicity.
  • Figure 5: Asymmetric pipelining integrates the prefilling stage into one sub-batch (red arrows) and most of the decoding attention operations into another (blue arrows). "pr" means pre-projection, while "po" means post-projection + FFN operations; "attn" means attention operations; "Comm" stands for GPU-CPU communication.
  • ...and 12 more figures