Table of Contents
Fetching ...

Unified Auto-Encoding with Masked Diffusion

Philippe Hansen-Estruch, Sriram Vishwanath, Amy Zhang, Manan Tomar

TL;DR

Unified Masked Diffusion (UMD) introduces a simple, asymmetric ViT-based auto-encoder that unifies MAE-style masking and diffusion-based denoising within a single objective. The key idea is a no-noise reconstruction step at $t=0$ with higher masking, combined with standard noisy steps at $t \ge 1$, optimized via $\mathcal{L}(\theta)_{UMD} = r_{t=0} \cdot \mathcal{L}(\theta)_{t=0} + (1 - r_{t=0}) \cdot \mathcal{L}(\theta)_{t \ge 1}$, enabling strong representations and generation without extra encoders. Empirically, UMD matches MAE on linear probing, competes with DiT in class-conditional generation when finetuned, and achieves notable efficiency gains over prior diffusion-based methods, across pixel and latent diffusion variants. The work highlights a practical path toward a single model suitable for both robust representations and high-quality generation, with potential for learned corruption schedules and further speedups in latent-space settings.

Abstract

At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.

Unified Auto-Encoding with Masked Diffusion

TL;DR

Unified Masked Diffusion (UMD) introduces a simple, asymmetric ViT-based auto-encoder that unifies MAE-style masking and diffusion-based denoising within a single objective. The key idea is a no-noise reconstruction step at with higher masking, combined with standard noisy steps at , optimized via , enabling strong representations and generation without extra encoders. Empirically, UMD matches MAE on linear probing, competes with DiT in class-conditional generation when finetuned, and achieves notable efficiency gains over prior diffusion-based methods, across pixel and latent diffusion variants. The work highlights a practical path toward a single model suitable for both robust representations and high-quality generation, with potential for learned corruption schedules and further speedups in latent-space settings.

Abstract

At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.
Paper Structure (31 sections, 7 equations, 8 figures, 5 tables)

This paper contains 31 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Unified Masked Diffusion (UMD). UMD combines random masking with the fine-grain noise schedule used in diffusion. The target is to predict the original image. For generation, UMD finetunes the encoder and decoder on unmasked noised images + class labels.
  • Figure 2: Transfer learning comparison with baselines. We run 10-shot linear probing on MAE, DiT, and UMD for the $64 \times 64 \times 3$ versions. UMD performs competitively with MAE while outperforming other diffusion methods on the different transfer datasets.
  • Figure 3: Samples after finetuning. We compare class conditioned samples generated from DiT, MAE and UMD after finetuning. Although MAE leads to a low FID score, the actual samples do not look as coherent as that of DiT and UMD, as evident by its low inception score as well.
  • Figure 4: FID and Linear Probing Results over Fine-Tuning. We fine-tune our baseline methods and UMD on labeled images for use in class-conditional generation. We report the 10k-FID/IS over gradient steps as well as the 100-shot linear probing performance of the representation layer. UMD remains competitive with DiT and MaskDiT in FID/IS performance and maintains its representation performance compared to MAE.
  • Figure 5: Latent Unified Masked Diffusion Samples. We pre-trained UMD on latent diffusion for 800 epochs on ImageNet and present selected samples after fine-tuning for 50 epochs. UMD achieves strong generations with a CFG scale of $s = 4.0$.
  • ...and 3 more figures