Table of Contents
Fetching ...

Effective and Efficient Masked Image Generation Models

Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li

TL;DR

This work unifies masked image generation and masked diffusion models into a single, coherent framework (eMIGM) and systematically investigates training and sampling design choices to maximize efficiency and quality. It introduces a time-interval classifier-free guidance strategy and adopts a diffusion-based conditional component to mitigate tokenization losses, enabling high-quality ImageNet generation with far fewer function evaluations. Empirical results on ImageNet at 256×256 and 512×512 demonstrate that eMIGM can surpass or closely match state-of-the-art diffusion models while requiring substantially fewer NFEs, with performance improving as model size scales. The study provides practical default settings, demonstrates the benefits of model scaling for both training and sampling, and supplies code for reproducibility.

Abstract

Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models. Code is available at https://github.com/ML-GSAI/eMIGM.

Effective and Efficient Masked Image Generation Models

TL;DR

This work unifies masked image generation and masked diffusion models into a single, coherent framework (eMIGM) and systematically investigates training and sampling design choices to maximize efficiency and quality. It introduces a time-interval classifier-free guidance strategy and adopts a diffusion-based conditional component to mitigate tokenization losses, enabling high-quality ImageNet generation with far fewer function evaluations. Empirical results on ImageNet at 256×256 and 512×512 demonstrate that eMIGM can surpass or closely match state-of-the-art diffusion models while requiring substantially fewer NFEs, with performance improving as model size scales. The study provides practical default settings, demonstrates the benefits of model scaling for both training and sampling, and supplies code for reproducibility.

Abstract

Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models. Code is available at https://github.com/ML-GSAI/eMIGM.

Paper Structure

This paper contains 21 sections, 11 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Generated samples from eMIGM trained on ImageNet $512\times512$.
  • Figure 2: Exploring the design space of training. Orange solid lines indicate the preferred choices in each subfigure.
  • Figure 3: Exploring the design space of sampling. For each plot, points from left to right correspond to an increasing number of mask prediction steps: 8, 16, 32, and up to 256. In each subfigure, DPM-Solver is donated as DPMS. (a) The exp schedule outperforms others by predicting fewer tokens early. (b) DPM-Solver performs better with fewer prediction steps. (c) The time interval maintains performance while reducing sampling cost for each mask prediction step, particularly for high mask prediction steps.
  • Figure 4: Scalability of eMIGM. (a) A negative correlation demonstrates that eMIGM benefits from scaling. (b) Larger models are more training-efficient (i.e., achieving better sample quality with the same training FLOPs). (c) Larger models are more sampling-efficient (i.e., achieving better sample quality with the same inference time).
  • Figure 5: Different choices of mask schedules. Left: $\gamma_t$ (i.e., the probability that each token is masked during the forward process). Right: Weight of the loss in MDM.
  • ...and 3 more figures