Table of Contents
Fetching ...

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Qiyuan He, Yicong Li, Haotian Ye, Jinghao Wang, Xinyao Liao, Pheng-Ann Heng, Stefano Ermon, James Zou, Angela Yao

TL;DR

This work identifies generator–tokenizer inconsistency as the main bottleneck in visual autoregressive generation, stemming from exposure bias and embedding unawareness. It proposes reAR, a plug-and-play token-wise regularization framework that aligns the autoregressive model’s hidden states with the tokenizer’s embeddings under noisy contexts and via embedding recovery objectives. Empirically, reAR yields substantial gains on ImageNet 256×256, reducing gFID from 3.02 to 1.86 and boosting IS to 316.9 with raster-based tokenizers, and achieving 1.42 gFID with 177M parameters using advanced tokenizers, rivaling diffusion models. The approach generalizes across tokenizers, scales with model size, and preserves training efficiency, highlighting a practical path toward tokenizer-friendly visual AR and unified multi-modal generation.

Abstract

Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

TL;DR

This work identifies generator–tokenizer inconsistency as the main bottleneck in visual autoregressive generation, stemming from exposure bias and embedding unawareness. It proposes reAR, a plug-and-play token-wise regularization framework that aligns the autoregressive model’s hidden states with the tokenizer’s embeddings under noisy contexts and via embedding recovery objectives. Empirically, reAR yields substantial gains on ImageNet 256×256, reducing gFID from 3.02 to 1.86 and boosting IS to 316.9 with raster-based tokenizers, and achieving 1.42 gFID with 177M parameters using advanced tokenizers, rivaling diffusion models. The approach generalizes across tokenizers, scales with model size, and preserves training efficiency, highlighting a practical path toward tokenizer-friendly visual AR and unified multi-modal generation.

Abstract

Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

Paper Structure

This paper contains 25 sections, 11 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: Generator-tokenizer inconsistency is the bottleneck in the visual autoregressive model.
  • Figure 2: Overview of reAR, a plug-and-play framework that is agnostic to the visual tokenizer.
  • Figure 3: Token sequence with the same correct token ratio ($\mathrm{CTR}$) under teacher forcing can be decoded into images with different quality. Under the same $\mathrm{CTR}$, (a) The images decoded from imperfect context is much less similar to the ground truth than the one from perfect context; (b) Replacing incorrect token with other incorrect tokens but with more similar embedding of the correct token, the generated image can be more similar to ground truth than original prediction.
  • Figure 4: Scaling Effect of reAR. As model size increases, the FID at each training step decreases consistently.
  • Figure 5: Sampling Speed. Comparison of different methods on FID and throughput (images/sec).
  • ...and 18 more figures