SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
Théophane Vallaeys, Jakob Verbeek, Matthieu Cord
TL;DR
SSDD tackles the inefficiencies of diffusion-based image tokenizers by introducing a GAN-free, single-step diffusion decoder designed for pixel-space tokenization. It blends a scalable U-Net–transformer architecture with flow matching, LPIPS perceptual alignment, and REPA regularization, followed by a single-step distillation that preserves multi-step diffusion behavior in a fast decoder. Across ImageNet-1k experiments, SSDD achieves state-of-the-art reconstruction (e.g., rFID improvements from $0.87$ to $0.50$) while delivering substantial speedups (up to $3.8\times$ faster sampling in downstream DiT generation) and maintaining high generation quality. The approach supports a shared encoder, enables multi-resolution pretraining, and offers a drop-in replacement for KL-VAE, paving the way for faster, higher-fidelity generative models at scale.
Abstract
Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
