Table of Contents
Fetching ...

KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference

Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, Elliott Delaye

TL;DR

KV Pareto presents a systems-level framework to jointly optimize KV cache quantization, prefill chunking, and 4-bit weight quantization for long-context inference. By evaluating multiple LLMs across KV quantization granularities and weight quantization schemes, it maps Pareto frontiers between memory savings and accuracy, demonstrating 68-78% total memory reductions with minimal 1-3% accuracy loss. The study underscores model-dependent frontiers and validates results on long-context benchmarks (LongBench, NIAH) and shorter tasks (GSM8k, MMLU), including context lengths up to 128k. Practical implications highlight scalable, edge-friendly configurations that dramatically reduce memory while preserving performance for long-context scenarios.

Abstract

Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.

KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference

TL;DR

KV Pareto presents a systems-level framework to jointly optimize KV cache quantization, prefill chunking, and 4-bit weight quantization for long-context inference. By evaluating multiple LLMs across KV quantization granularities and weight quantization schemes, it maps Pareto frontiers between memory savings and accuracy, demonstrating 68-78% total memory reductions with minimal 1-3% accuracy loss. The study underscores model-dependent frontiers and validates results on long-context benchmarks (LongBench, NIAH) and shorter tasks (GSM8k, MMLU), including context lengths up to 128k. Practical implications highlight scalable, edge-friendly configurations that dramatically reduce memory while preserving performance for long-context scenarios.

Abstract

Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.

Paper Structure

This paper contains 37 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our KV Pareto Framework, showcasing the integration of prefill chunking (PC), KV cache quantization and model quantization for prefill and decode phases.
  • Figure 2: Pareto curves for five models that show the tradeoff between task accuracy and memory consumption, with frontiers shown with a star, and horizontal lines showing baseline (w16a16_k16v16) accuracy.
  • Figure 3: NIAH performance on baseline (a) and pareto-optimal configurations (b).
  • Figure 4: Peak memory consumption on 10k vs. 128k context lengths, comparing SDPA and Flash MHA.
  • Figure 5: TPOT and TTFT curves on the the HotpotQA dataset, showcasing the bottleneck of a growing KV cache at longer contexts.
  • ...and 3 more figures