Table of Contents
Fetching ...

AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

Boxun Xu, Yu Wang, Zihu Wang, Peng Li

TL;DR

This work addresses the memory inefficiency of KV caches in Visual Autoregressive Transformers (VAR) that generate images via next-scale prediction. By systematically analyzing cross-scale attention, it identifies condensed and local scales as highly influential and reveals heterogeneity across decoder layers in cache demands. It then introduces AMS-KV, an adaptive per-layer KV caching policy with condensation, local-scale rolling, and a similarity-guided expansion mechanism coupled with a Condensed Least Recently Used eviction; this yields substantial memory reductions while maintaining generation quality. Across VAR variants and Infinity-2B, AMS-KV achieves up to about 84.83% KV cache reduction and significant latency savings, enabling larger batch sizes and mitigating OOM, thereby enhancing the practicality and scalability of VAR-based vision generation.

Abstract

Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

TL;DR

This work addresses the memory inefficiency of KV caches in Visual Autoregressive Transformers (VAR) that generate images via next-scale prediction. By systematically analyzing cross-scale attention, it identifies condensed and local scales as highly influential and reveals heterogeneity across decoder layers in cache demands. It then introduces AMS-KV, an adaptive per-layer KV caching policy with condensation, local-scale rolling, and a similarity-guided expansion mechanism coupled with a Condensed Least Recently Used eviction; this yields substantial memory reductions while maintaining generation quality. Across VAR variants and Infinity-2B, AMS-KV achieves up to about 84.83% KV cache reduction and significant latency savings, enabling larger batch sizes and mitigating OOM, thereby enhancing the practicality and scalability of VAR-based vision generation.

Abstract

Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

Paper Structure

This paper contains 33 sections, 4 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: Top: The $256\times 256$ images generated by VAR-d30var and $1024\times 1024$ images generated Infinity-2Binfinity. Bottom: generated using tuning-free AMS-KV with 4.7$\times$ less KV Cache Memory consumption.
  • Figure 2: Overview of Adaptive Multi-Scale KV Caching (AMS-KV) for Visual Autoregressive Modeling.
  • Figure 3: (a) The growth of attention map. (b) Memory usage of VAR across scales during unconditional image generation on an NVIDIA A100-80G (Batch Size=50)
  • Figure 4: Visualization of the attention density across heads and scales when generating the last scale $r_{10}$.
  • Figure 5: An eagle generated by VAR-d30 with (a) full KV cache; (b) intermediate scales removed; (c) local scales removed; (d) condensed scales removed. Red circles highlight missing fine details; blue circles indicate distorted regions.
  • ...and 4 more figures