Table of Contents
Fetching ...

TensorAR: Refinement is All You Need in Autoregressive Image Generation

Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan

TL;DR

TensorAR redefines autoregressive image generation by predicting overlapping tensors (next-tensor prediction) rather than single tokens, enabling iterative refinement of previously generated content. A discrete tensor noising scheme prevents information leakage during training and a plug-and-play input encoder/output decoder enables seamless integration with existing AR models. Empirical results on Open-MAGVIT2, RAR, and LlamaGEN demonstrate consistent FID improvements across model sizes, approaching diffusion-quality fidelity with modest overhead. The approach offers a practical, architecture-agnostic path to enhance AR-based image synthesis while preserving compatibility with multimodal LLMs, with potential extensions to text-conditioned generation.

Abstract

Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

TensorAR: Refinement is All You Need in Autoregressive Image Generation

TL;DR

TensorAR redefines autoregressive image generation by predicting overlapping tensors (next-tensor prediction) rather than single tokens, enabling iterative refinement of previously generated content. A discrete tensor noising scheme prevents information leakage during training and a plug-and-play input encoder/output decoder enables seamless integration with existing AR models. Empirical results on Open-MAGVIT2, RAR, and LlamaGEN demonstrate consistent FID improvements across model sizes, approaching diffusion-quality fidelity with modest overhead. The approach offers a practical, architecture-agnostic path to enhance AR-based image synthesis while preserving compatibility with multimodal LLMs, with potential extensions to text-conditioned generation.

Abstract

Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

Paper Structure

This paper contains 17 sections, 4 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of the inference scheme of different autoregressive image generators.
  • Figure 2: The paragraph of the proposed discrete tensor noising scheme for training. We use a darker green to denote the higher intensity of noise.
  • Figure 3: Generated samples.
  • Figure 4: FID curves across different model sizes.
  • Figure 5: Speed/accuracy trade-off.
  • ...and 2 more figures