Table of Contents
Fetching ...

MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

Junyang Zhang, Tianyi Zhu, Cheng Luo, Anima Anandkumar

TL;DR

MOM addresses the memory bottleneck of long-context LLM inference by combining mini-sequence processing for MLPs with dynamic KV cache offloading. It achieves over 50% peak memory reduction and extends the maximum context length from 155k to 455k tokens on a single A100-80GB GPU while preserving output equivalence and maintaining competitive throughput. By eliminating prefill memory as the dominant constraint, MOM shifts focus to decode-stage residual KV-cache efficiency, enabling practical deployment of long-context models on affordable hardware. This approach broadens accessibility of large language models and suggests a new research emphasis on decode-time KV-cache optimization.

Abstract

Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50\% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35\% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.

MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

TL;DR

MOM addresses the memory bottleneck of long-context LLM inference by combining mini-sequence processing for MLPs with dynamic KV cache offloading. It achieves over 50% peak memory reduction and extends the maximum context length from 155k to 455k tokens on a single A100-80GB GPU while preserving output equivalence and maintaining competitive throughput. By eliminating prefill memory as the dominant constraint, MOM shifts focus to decode-stage residual KV-cache efficiency, enabling practical deployment of long-context models on affordable hardware. This approach broadens accessibility of large language models and suggests a new research emphasis on decode-time KV-cache optimization.

Abstract

Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50\% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35\% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.

Paper Structure

This paper contains 27 sections, 5 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: GPU Memory Comparison of Llama 3 Standard vs. Llama 3 with MOM for a 64K Input Context.
  • Figure 2: Memory vs. Throughput (Average of Various Input Sequence Lengths).
  • Figure 3: MOM Architecture Overview.
  • Figure 4: Dynamic KV Cache Transfer Between GPU and CPU in Prefill and Decode Stages.
  • Figure 5: VRAM Comparison for Mini-sequence Inference and Offloads.
  • ...and 8 more figures