Table of Contents
Fetching ...

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

TL;DR

The paper investigates using frozen vision foundation models as image tokenizers for autoregressive image generation. It introduces VFMTok, a region-adaptive tokenizer that samples semantically coherent, irregular regions via deformable attention, and jointly reconstructs both the original image and the foundation-model features to preserve semantics. By combining a frozen VFM encoder, region-adaptive tokens, and a feature-alignment objective, VFMTok achieves high-quality reconstruction and generation with substantially fewer tokens (256 vs ~576), accelerates AR convergence (up to 3×), and delivers CFG-free, high-fidelity class-conditional synthesis, including a gFID of 1.36 on ImageNet with advanced AR models. This approach highlights the potential of VFMs as powerful, efficient priors for tokenization, enabling scalable, high-quality image synthesis and opening avenues toward unified visual generation and understanding.

Abstract

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

TL;DR

The paper investigates using frozen vision foundation models as image tokenizers for autoregressive image generation. It introduces VFMTok, a region-adaptive tokenizer that samples semantically coherent, irregular regions via deformable attention, and jointly reconstructs both the original image and the foundation-model features to preserve semantics. By combining a frozen VFM encoder, region-adaptive tokens, and a feature-alignment objective, VFMTok achieves high-quality reconstruction and generation with substantially fewer tokens (256 vs ~576), accelerates AR convergence (up to 3×), and delivers CFG-free, high-fidelity class-conditional synthesis, including a gFID of 1.36 on ImageNet with advanced AR models. This approach highlights the potential of VFMs as powerful, efficient priors for tokenization, enabling scalable, high-quality image synthesis and opening avenues toward unified visual generation and understanding.

Abstract

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

Paper Structure

This paper contains 26 sections, 2 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: VFMTok introduces novel features, including: a).region-adaptive quantization— where it adaptively samples regions of similar patterns and extracts their VFM features for quantization; b).convergence speed improvement compared with vanilla VQGAN llamagen for AR image synthesis.
  • Figure 2: The framework of VFMTok. VFMTok utilizes a frozen VFM to extract multi-level image features. A deformable Transformer then processes these features with learnable grid queries to generate region-adaptive tokens. After quantization, these tokens are fed into a shared ViT for dual reconstruction: 1) VFM features, targeting similarity with the VFM's last-layer outputs, and 2) image latent features, which are reshaped to a 2D grid and decoded into pixels.
  • Figure 3: Class-conditional image generation with CFG.
  • Figure 4: Class-conditional image generation without CFG.