Table of Contents
Fetching ...

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, Kai Zhang

TL;DR

The paper addresses the challenge of designing diffusion-friendly visual tokenizers by aligning a pretrained visual foundation encoder to a lightweight tokenizer. It introduces a three-stage alignment—Latent Alignment, Perceptual Alignment with a semantic preservation loss, and Decoder Refinement—using DINOv2 as the default backbone to produce semantically grounded latent spaces. On ImageNet $256\times256$, the tokenizer accelerates diffusion convergence to a gFID of $1.90$ at 64 epochs and improves generation with and without CFG; on LAION, a 2B-parameter T2I model trained with the tokenizer outperforms FLUX VAE at the same steps. The approach is simple, scalable, and extends the semantic grounding paradigm to tokenizers, offering improvements in diffusion-based generation and potential applicability to larger resolutions and multi-modal settings.

Abstract

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

TL;DR

The paper addresses the challenge of designing diffusion-friendly visual tokenizers by aligning a pretrained visual foundation encoder to a lightweight tokenizer. It introduces a three-stage alignment—Latent Alignment, Perceptual Alignment with a semantic preservation loss, and Decoder Refinement—using DINOv2 as the default backbone to produce semantically grounded latent spaces. On ImageNet , the tokenizer accelerates diffusion convergence to a gFID of at 64 epochs and improves generation with and without CFG; on LAION, a 2B-parameter T2I model trained with the tokenizer outperforms FLUX VAE at the same steps. The approach is simple, scalable, and extends the semantic grounding paradigm to tokenizers, offering improvements in diffusion-based generation and potential applicability to larger resolutions and multi-modal settings.

Abstract

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

Paper Structure

This paper contains 26 sections, 5 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Regularization vs. Alignment.
  • Figure 2: Method Overview.Stage 1: Latent Alignment (top). The pretrained encoder is kept frozen, while the adapter and decoder are trained with reconstruction loss to align its output into a semantically grounded latent space for generation. Stage 2: Perceptual Alignment (bottom left). All components are optimized jointly to enrich the latent space with low-level details, while a semantic preservation loss ensures that it retains high-level semantics. Stage 3: Decoder Refinement (bottom right). Only the decoder is fine-tuned with reconstruction loss to enhance reconstruction fidelity.
  • Figure 3: Reconstruction vs. Semantic Preservation in Tokenizer Training.Left: reconstruction FID (rFID) across training steps. Right: linear probing accuracy across training steps. Linear probing accuracy is evaluated on the latent code (32 channels), except for All Stages, Pre-Adapter (1024 channels), which is reported only for reference. In this case, linear probing accuracy is measured on the feature before the adapter, using the same checkpoint as Stage 2 w/ Semantic Preservation Loss.
  • Figure 4: Comparison of Sampling Steps, CFG Scales, and Convergence Speed. Evaluated on ImageNet 256$\times$256. Left: effect of sampling steps versus gFID at 80K training steps. Middle: effect of CFG scale versus gFID at 80K training steps with 30 sampling steps. Right: effect of training steps versus gFID with 30 sampling steps. QKNorm is enabled during extended training to ensure stability. All gFIDs in the left and right figures are reported using the best-searched CFG scale.
  • Figure 5: Qualitative Comparison on Text-to-Image Generation with FLUX VAE. Input text prompts are shown below the images and results (256$\times$256 resolution) are generated from generative models trained for 100K steps. Our method (bottom row) produces images with better coherence and prompt alignment compared to the one using FLUX VAE (top row).
  • ...and 16 more figures