Table of Contents
Fetching ...

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

TL;DR

The paper investigates what makes a latent space effective for diffusion-based image synthesis and argues that a discriminative latent structure with few GMM modes is more beneficial than variational regularization. It introduces MAETok, a masked autoencoder tokenizer that learns semantically rich latent representations while maintaining reconstruction fidelity, using mask modeling and auxiliary decoders to predict multiple targets. Empirical results on ImageNet show MAETok enables state-of-the-art-like generation with only 128 latent tokens, achieving 1.69 gFID at 512x512 and substantial speedups in training and inference, while theoretical and empirical analyses link fewer latent-space modes to better diffusion performance. The work highlights latent-space structure as a key driver for diffusion efficiency and quality, offering a scalable path to high-resolution generation with reduced computational overhead.

Abstract

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

TL;DR

The paper investigates what makes a latent space effective for diffusion-based image synthesis and argues that a discriminative latent structure with few GMM modes is more beneficial than variational regularization. It introduces MAETok, a masked autoencoder tokenizer that learns semantically rich latent representations while maintaining reconstruction fidelity, using mask modeling and auxiliary decoders to predict multiple targets. Empirical results on ImageNet show MAETok enables state-of-the-art-like generation with only 128 latent tokens, achieving 1.69 gFID at 512x512 and substantial speedups in training and inference, while theoretical and empirical analyses link fewer latent-space modes to better diffusion performance. The work highlights latent-space structure as a key driver for diffusion efficiency and quality, offering a scalable path to high-resolution generation with reduced computational overhead.

Abstract

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.

Paper Structure

This paper contains 29 sections, 4 theorems, 26 equations, 22 figures, 13 tables.

Key Result

Theorem 2.1

(Informal, see app-theorem:2.2) Let the data distribution be a mixture of $K$ Gaussians as defined in eq:main-gmm. Then assume the norm of each mode is bounded by some constants, let $d$ be the data dimension, $T$ be the total time steps, and $\epsilon$ be a proper target error parameter. In order t where the upper bound of the mean norm satisfies $\max_i \|\boldsymbol{\mu}_i\| \leq B$.

Figures (22)

  • Figure 1: Diffusion models with MAETok achieves state-of-the-art image generation on ImageNet of 512$\times$512 and 256$\times$256 resolution.
  • Figure 2: GMM fitting on latent space of AE, VAE, VAVAE, and MAETok. Fewer GMM modes in latent space usually corresponds to lower diffusion losses and better generation performance.
  • Figure 3: Model architecture of MAETok. We adopt the plain 1D autoencoder (AE) as tokenizer, with a vision transformer (ViT) encoder $\mathcal{E}$ and decoder $\mathcal{D}$. MAETok is trained using mask modeling at encoder, with a mask ratio of 40-60%, and predict multiple target features, e.g., HOG, DINO-v2, and CLIP features, of masked tokens from the unmasked ones using auxiliary shallow decoders.
  • Figure 4: UMAP visualization on ImageNet of the learned latent space from (a) AE; (b) VAE; (c) MAETok. Colors indicate different classes. MAETok presents a more discriminative latent space.
  • Figure 5: The latent space from tokenizer correlates strongly with generation performance. More discriminative latent space (a) with higher linear probing (LP) accuracy usually leads to better gFID, and (b) makes the learning of the diffusion model easier and faster.
  • ...and 17 more figures

Theorems & Definitions (5)

  • Theorem 2.1
  • Remark 1.4
  • Theorem 1.5
  • Theorem 1.6
  • Theorem 1.7