Table of Contents
Fetching ...

Diffusion Autoencoders are Scalable Image Tokenizers

Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra

TL;DR

Diffusion Autoencoders are Diffusion Tokenizers (DiTo) present a simple, self-supervised approach to learning compact image representations by training an encoder and a diffusion-based decoder with a single ELBO-aligned diffusion objective. By replacing complex, multi-term losses with a Flow Matching loss and introducing noise-synchronization regularization, DiTo achieves competitive or superior image reconstruction and downstream generation compared to the supervised GAN-LPIPS tokenizer GLPTo, especially at scale. The work provides theoretical grounding in ELBO, demonstrates scalability across model sizes, and shows that jointly learning latent representations with a diffusion decoder yields robust, high-quality reconstructions and generation, while remaining fully self-supervised. Overall, DiTo offers a simpler, scalable alternative for image tokenization, with strong empirical results and clear avenues for broader application to higher resolutions and other modalities.

Abstract

Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.

Diffusion Autoencoders are Scalable Image Tokenizers

TL;DR

Diffusion Autoencoders are Diffusion Tokenizers (DiTo) present a simple, self-supervised approach to learning compact image representations by training an encoder and a diffusion-based decoder with a single ELBO-aligned diffusion objective. By replacing complex, multi-term losses with a Flow Matching loss and introducing noise-synchronization regularization, DiTo achieves competitive or superior image reconstruction and downstream generation compared to the supervised GAN-LPIPS tokenizer GLPTo, especially at scale. The work provides theoretical grounding in ELBO, demonstrates scalability across model sizes, and shows that jointly learning latent representations with a diffusion decoder yields robust, high-quality reconstructions and generation, while remaining fully self-supervised. Overall, DiTo offers a simpler, scalable alternative for image tokenization, with strong empirical results and clear avenues for broader application to higher resolutions and other modalities.

Abstract

Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.

Paper Structure

This paper contains 44 sections, 15 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Diffusion tokenizer (DiTo) is a diffusion autoencoder with an ELBO objective (e.g., Flow Matching). The input image $\bm{x}$ is passed into the encoder $E$ to obtain the latent representation, i.e., 'tokens' $\bm{z}$, a decoder $D$ then learns the distribution $p(\bm{x}|\bm{z})$ with the diffusion objective. $E$ and $D$ are jointly trained from scratch. In contrast, prior work (a) relies on a combination of losses, heuristics, and pretrained models to learn.
  • Figure 2: Comparison of GAN-LPIPS tokenizer (GLPTo) and diffusion tokenizer (DiTo). GLPTo uses a weighted combination of L1, LPIPS, and GAN loss, while DiTo only uses a diffusion L2 loss. Despite the simplicity, we observe that when being scaled up, DiTo is competitive to or better than GLPTo for reconstruction, as shown in the examples (at 256 pixel resolution).
  • Figure 3: Scalability of diffusion tokenizers. When increasing the number of trainable parameters in the diffusion decoder from DiTo-B, DiTo-L, to DiTo-XL in the joint training, we observe that the image reconstruction quality keeps improving for structures and textures. Both the visual quality and reconstruction faithfulness are improved when scaling up the diffusion tokenizer.
  • Figure 4: Comparison for human preference of image reconstructions. Models are compared to GLPTo at the same scale. When being scaled up, we observe that DiTo's (without perceptual loss) visual quality significantly improves and outperforms GLPTo in human preference.
  • Figure 5: Comparison of training objectives in diffusion tokenizers. The frozen $\bm{z}$ space is from a GLPTo-B. We observe that when jointly training the encoder and diffusion decoder, ELBO diffusion objectives (flow matching, $\bm{v}$-pred with cosine schedule) can learn good latent representation $\bm{z}$, while other objectives may have color shift in the reconstruction (colors are good given a frozen $\bm{z}$ space).
  • ...and 6 more figures