Table of Contents
Fetching ...

Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin

TL;DR

<3-5 sentence high-level summary> LineAR tackles the memory bottleneck in autoregressive image generation by proposing a training-free, progressive KV cache compression that treats cache as a 2D raster-line structure. It preserves essential initial anchors and recent lines while progressively evicting less informative tokens from a mid-region under inter-line guidance, keeping the cache within a fixed budget. Across six AR visual models and multiple tasks, LineAR achieves substantial memory reductions and throughput speedups with maintained or improved generation quality. The method leverages local visual dependencies and strong inter-line attention consistency to enable safe, line-by-line cache compression without retraining. These results demonstrate practical improvements in deployment scalability for AR-based multimodal generation systems.

Abstract

Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

TL;DR

<3-5 sentence high-level summary> LineAR tackles the memory bottleneck in autoregressive image generation by proposing a training-free, progressive KV cache compression that treats cache as a 2D raster-line structure. It preserves essential initial anchors and recent lines while progressively evicting less informative tokens from a mid-region under inter-line guidance, keeping the cache within a fixed budget. Across six AR visual models and multiple tasks, LineAR achieves substantial memory reductions and throughput speedups with maintained or improved generation quality. The method leverages local visual dependencies and strong inter-line attention consistency to enable safe, line-by-line cache compression without retraining. These results demonstrate practical improvements in deployment scalability for AR-based multimodal generation systems.

Abstract

Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

Paper Structure

This paper contains 12 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: LineAR enables efficient autoregressive image generation, preserving only 1/8, 1/6, and 1/6 of the KV cache, achieving up to 2.13$\times$, 5.62$\times$, and 7.57$\times$ speedup on Lumina-mGPT, Janus-Pro, and LlamaGen models, with improved or comparable generation quality.
  • Figure 2: Visualization of attention patterns and allocation. The attention persistently shifts toward the conditional tokens, while visual attention gradually dilutes as decoding progresses.
  • Figure 3: Visualization of attention evolution in line generation. The current line generation relies on tokens from the recent region and initial anchor, resulting in mid-region cache redundancy.
  • Figure 4: (Left) Attention similarity between adjacent lines. (Right) Attention visualization for past generated regions across two adjacent lines. Inter-line attention shows high consistency.
  • Figure 5: Visual KV cache management from a 2D perspective. The visual KV cache is organized in a 2D view, where each horizontal line represents a generating stage and serves as a natural unit for cache management in LineAR.
  • ...and 4 more figures