Table of Contents
Fetching ...

SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

Théophane Vallaeys, Jakob Verbeek, Matthieu Cord

TL;DR

SSDD tackles the inefficiencies of diffusion-based image tokenizers by introducing a GAN-free, single-step diffusion decoder designed for pixel-space tokenization. It blends a scalable U-Net–transformer architecture with flow matching, LPIPS perceptual alignment, and REPA regularization, followed by a single-step distillation that preserves multi-step diffusion behavior in a fast decoder. Across ImageNet-1k experiments, SSDD achieves state-of-the-art reconstruction (e.g., rFID improvements from $0.87$ to $0.50$) while delivering substantial speedups (up to $3.8\times$ faster sampling in downstream DiT generation) and maintaining high generation quality. The approach supports a shared encoder, enables multi-resolution pretraining, and offers a drop-in replacement for KL-VAE, paving the way for faster, higher-fidelity generative models at scale.

Abstract

Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.

SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

TL;DR

SSDD tackles the inefficiencies of diffusion-based image tokenizers by introducing a GAN-free, single-step diffusion decoder designed for pixel-space tokenization. It blends a scalable U-Net–transformer architecture with flow matching, LPIPS perceptual alignment, and REPA regularization, followed by a single-step distillation that preserves multi-step diffusion behavior in a fast decoder. Across ImageNet-1k experiments, SSDD achieves state-of-the-art reconstruction (e.g., rFID improvements from to ) while delivering substantial speedups (up to faster sampling in downstream DiT generation) and maintaining high generation quality. The approach supports a shared encoder, enables multi-resolution pretraining, and offers a drop-in replacement for KL-VAE, paving the way for faster, higher-fidelity generative models at scale.

Abstract

Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from to with higher throughput and preserve generation quality of DiTs with faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.

Paper Structure

This paper contains 17 sections, 1 equation, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Left: Speed-quality Pareto-front for different state-of-the-art f8c4 feedforward and diffusion autoencoders. Right: Reconstructions of KL-VAE and SSDD models with similar throughput. Bottom: High-level overview of our method.
  • Figure 2: Training of SSDD tokenizer. Input image $x_0$ is mapped to latents $z$ by the (possibly frozen) encoder $E$. Noise $\epsilon\sim \mathcal{N}(0,1)$ is sampled and added to $x_0$ to form the noisy input $x_t$. The decoder $D$ learns to denoise $x_t$ conditioned on $z$ (input + AdaNorm) and $t$ (AdaNorm). The model is trained with flow-matching (generative), REPA (features alignment) and LPIPS (perceptual) losses.
  • Figure 3: Evolution of rFID and qualitative comparison when increasing spatial downsampling. Evaluated on ImageNet $256\!\times\!256$ with a constant compression ratio by adjusting $c$.
  • Figure S1: Evolution of reconstruction metrics depending on the number of sampling steps $N$. Evaluated on ImageNet $256\!\times\!256$.
  • Figure S2: Effect of sampling on Density and coverage.
  • ...and 5 more figures