Table of Contents
Fetching ...

AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains

Abhishek Vijaya Kumar, Gianni Antichi, Rachee Singh

TL;DR

GPU memory capacity limits constrain LLM inference under bursty loads, causing contention and degraded responsiveness. Aqua introduces a memory management stack that decouples memory from compute and enables preemptive, fair scheduling across a scale-up Nvlink domain, using Aqua-profiler, Aqua-placer, and Aqua-lib to offload and migrate inference state with low paging overhead. On eight Nvidia H100 80GB GPUs, Aqua delivers approximately 20x improvements in time-to-first-token and up to 4x gains in long-prompt throughput, while maintaining high utilization under memory pressure. By leveraging fast inter-GPU memory sharing and dynamic memory elasticity, Aqua enables responsive, scalable LLM serving in modern datacenters.

Abstract

Inference on large-language models (LLMs) is constrained by GPU memory capacity. A sudden increase in the number of inference requests to a cloud-hosted LLM can deplete GPU memory, leading to contention between multiple prompts for limited resources. Modern LLM serving engines deal with the challenge of limited GPU memory using admission control, which causes them to be unresponsive during request bursts. We propose that preemptive scheduling of prompts in time slices is essential for ensuring responsive LLM inference, especially under conditions of high load and limited GPU memory. However, preempting prompt inference incurs a high paging overhead, which reduces inference throughput. We present Aqua, a GPU memory management framework that significantly reduces the overhead of paging inference state, achieving both responsive and high-throughput inference even under bursty request patterns. We evaluate Aqua by hosting several state-of-the-art large generative ML models of different modalities on servers with 8 Nvidia H100 80G GPUs. Aqua improves the responsiveness of LLM inference by 20X compared to the state-of-the-art and improves LLM inference throughput over a single long prompt by 4X.

AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains

TL;DR

GPU memory capacity limits constrain LLM inference under bursty loads, causing contention and degraded responsiveness. Aqua introduces a memory management stack that decouples memory from compute and enables preemptive, fair scheduling across a scale-up Nvlink domain, using Aqua-profiler, Aqua-placer, and Aqua-lib to offload and migrate inference state with low paging overhead. On eight Nvidia H100 80GB GPUs, Aqua delivers approximately 20x improvements in time-to-first-token and up to 4x gains in long-prompt throughput, while maintaining high utilization under memory pressure. By leveraging fast inter-GPU memory sharing and dynamic memory elasticity, Aqua enables responsive, scalable LLM serving in modern datacenters.

Abstract

Inference on large-language models (LLMs) is constrained by GPU memory capacity. A sudden increase in the number of inference requests to a cloud-hosted LLM can deplete GPU memory, leading to contention between multiple prompts for limited resources. Modern LLM serving engines deal with the challenge of limited GPU memory using admission control, which causes them to be unresponsive during request bursts. We propose that preemptive scheduling of prompts in time slices is essential for ensuring responsive LLM inference, especially under conditions of high load and limited GPU memory. However, preempting prompt inference incurs a high paging overhead, which reduces inference throughput. We present Aqua, a GPU memory management framework that significantly reduces the overhead of paging inference state, achieving both responsive and high-throughput inference even under bursty request patterns. We evaluate Aqua by hosting several state-of-the-art large generative ML models of different modalities on servers with 8 Nvidia H100 80G GPUs. Aqua improves the responsiveness of LLM inference by 20X compared to the state-of-the-art and improves LLM inference throughput over a single long prompt by 4X.
Paper Structure (23 sections, 6 equations, 13 figures, 1 table)

This paper contains 23 sections, 6 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Responsiveness (measured using time-to-first-token or TTFT) and throughput (measured using request completion time or RCT) of inference queries on LLMs. Since vLLM batch processes queries, it has low RCT (high throughput) but high TTFT (low responsiveness). Fair-scheduling queries improves responsiveness but paging overheads dominate RCT. Aqua reduces paging overheads by offloading memory over high-speed multi-GPU interconnects (e.g.,Nvlinks), achieving responsive inference with low RCT.
  • Figure 2: Design of Aqua.
  • Figure 3: The plots show that when audio and image generation models reach the plateau of their throughput, there are 10s of GBs of free memory on an A100 80GB GPU. In fact, increasing the batch-size beyond a point results in diminishing increase in throughput.
  • Figure 4: \ref{['fig:tpot_sharegpt_opp']} shows the time per output token (TPOT) for the Llama 3.1 8B model on an 80 GB A100 GPU. TPOT increases with request rate. \ref{['fig:ttft_sharegpt_opp']} illustrates the time to first token (TTFT), rising with the queue length from incoming requests.
  • Figure 5: Design of Aqua. GPU 1 is hosting a consumer. Aqua-lib is aware that GPU 0 is a producer. Aqua-lib allows the model on GPU 1 to allocate Aqua Tensors that are offloaded to GPU 0's HBM (shown in a pink box with number 1). If GPU 0 only has enough memory to offload one tensor, Aqua-lib falls back to the host DRAM ( shown using a pink box with number 2).
  • ...and 8 more figures