Table of Contents
Fetching ...

Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Ziran Qin, Youru Lv, Mingbao Lin, Hang Guo, Zeren Zhang, Danping Zou, Weiyao Lin

TL;DR

The paper tackles the memory and computation bottlenecks in Visual AutoRegressive (VAR) models caused by accumulating KV caches across scales. It introduces HACK, a training-free, head-aware KV cache compression framework that differentiates contextual and structural attention heads, assigns asymmetric cache budgets, and applies pattern-specific compression to dramatically reduce attention complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(Bn^2)$ without sacrificing generation quality. Through offline head classification and targeted KV pruning, HACK achieves up to 70% KV compression and substantial memory and latency gains across multiple VAR models and tasks, including text-to-image and class-conditional generation. The approach demonstrates robust generalizability and compatibility with existing acceleration techniques, offering a practical path to scalable, high-quality VAR inference.

Abstract

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality content generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale generation paradigm. We begin with a crucial observation: attention heads in VAR models can be divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads are responsible for preserving spatial coherence. This structural divergence causes existing one-size-fits-all compression methods to perform poorly on VAR models. To address this, we propose HACK, a training-free Head-Aware KV cache Compression frameworK. HACK utilizes an offline classification scheme to separate head types, enabling it to apply pattern-specific compression strategies with asymmetric cache budgets for each category. By doing so, HACK effectively constrains the average KV cache length within a fixed budget $B$, reducing the theoretical attention complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(Bn^2)$. Extensive experiments on multiple VAR models across text-to-image and class-conditional tasks validate the effectiveness and generalizability of HACK. It achieves up to 70% KV cache compression without degrading output quality, resulting in memory savings and faster inference. For example, HACK provides a $1.75\times$ memory reduction and a $1.57\times$ speedup on Infinity-8B.

Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

TL;DR

The paper tackles the memory and computation bottlenecks in Visual AutoRegressive (VAR) models caused by accumulating KV caches across scales. It introduces HACK, a training-free, head-aware KV cache compression framework that differentiates contextual and structural attention heads, assigns asymmetric cache budgets, and applies pattern-specific compression to dramatically reduce attention complexity from to without sacrificing generation quality. Through offline head classification and targeted KV pruning, HACK achieves up to 70% KV compression and substantial memory and latency gains across multiple VAR models and tasks, including text-to-image and class-conditional generation. The approach demonstrates robust generalizability and compatibility with existing acceleration techniques, offering a practical path to scalable, high-quality VAR inference.

Abstract

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality content generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale generation paradigm. We begin with a crucial observation: attention heads in VAR models can be divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads are responsible for preserving spatial coherence. This structural divergence causes existing one-size-fits-all compression methods to perform poorly on VAR models. To address this, we propose HACK, a training-free Head-Aware KV cache Compression frameworK. HACK utilizes an offline classification scheme to separate head types, enabling it to apply pattern-specific compression strategies with asymmetric cache budgets for each category. By doing so, HACK effectively constrains the average KV cache length within a fixed budget , reducing the theoretical attention complexity from to . Extensive experiments on multiple VAR models across text-to-image and class-conditional tasks validate the effectiveness and generalizability of HACK. It achieves up to 70% KV cache compression without degrading output quality, resulting in memory savings and faster inference. For example, HACK provides a memory reduction and a speedup on Infinity-8B.

Paper Structure

This paper contains 28 sections, 17 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: (a) Vanilla VAR models cache all KV pairs across different scales. (b) HACK only preserves proper KV pairs selected by head-aware strategies, effectively hacking down both attention complexity and KV cache length.
  • Figure 2: Attention Patterns of Contextual and Structural Heads. Both Contextual and Structural heads exhibit consistent vertical and multi-diagonal patterns, respectively, across different samples and scales.
  • Figure 3: Impact of selective head masking (10% of total heads for each type). Compared with the original generation (a), masking contextual heads (b) leads to semantic divergence while maintaining high visual fidelity; in contrast, masking structural heads (c) preserves global content direction but results in severe structural degradation.
  • Figure 4: Empirical analysis on VAR-d30. (Left) Compression sensitivity of contextual vs. structural heads. (Right) Comparison of two scale-preserving strategies. Experiments are conducted on 8K ImageNet samples.
  • Figure 5: Overview of our proposed HACK framework. It consists of (a) offline head classification via attention variance, (b) asymmetric head budget allocation based on compression sensitivity, and (c) pattern-specific KV cache compression strategies.
  • ...and 13 more figures