Table of Contents
Fetching ...

Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance

Jiwan Hur, Dong-Jae Lee, Gyojin Han, Jaehyun Choi, Yunho Jeon, Junmo Kim

TL;DR

Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior quality-diversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs.

Abstract

Masked generative models (MGMs) have shown impressive generative ability while providing an order of magnitude efficient sampling steps compared to continuous diffusion models. However, MGMs still underperform in image synthesis compared to recent well-developed continuous diffusion models with similar size in terms of quality and diversity of generated samples. A key factor in the performance of continuous diffusion models stems from the guidance methods, which enhance the sample quality at the expense of diversity. In this paper, we extend these guidance methods to generalized guidance formulation for MGMs and propose a self-guidance sampling method, which leads to better generation quality. The proposed approach leverages an auxiliary task for semantic smoothing in vector-quantized token space, analogous to the Gaussian blur in continuous pixel space. Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior quality-diversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs. Extensive experiments with the various sampling hyperparameters confirm the effectiveness of the proposed self-guidance.

Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance

TL;DR

Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior quality-diversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs.

Abstract

Masked generative models (MGMs) have shown impressive generative ability while providing an order of magnitude efficient sampling steps compared to continuous diffusion models. However, MGMs still underperform in image synthesis compared to recent well-developed continuous diffusion models with similar size in terms of quality and diversity of generated samples. A key factor in the performance of continuous diffusion models stems from the guidance methods, which enhance the sample quality at the expense of diversity. In this paper, we extend these guidance methods to generalized guidance formulation for MGMs and propose a self-guidance sampling method, which leads to better generation quality. The proposed approach leverages an auxiliary task for semantic smoothing in vector-quantized token space, analogous to the Gaussian blur in continuous pixel space. Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior quality-diversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs. Extensive experiments with the various sampling hyperparameters confirm the effectiveness of the proposed self-guidance.

Paper Structure

This paper contains 18 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparison of sampled images using 18-step MaskGIT maskgit without (top) and with the proposed self-guidance (bottom) on ImageNet 512$\times$512 (left) and 256$\times$256 (right) resolutions. Each paired image is sampled using the same random seed and sampling hyperparameters. The proposed self-guidance effectively improves the capabilities of the masked generative models.
  • Figure 2: Visualization of the effect of guidance using spatial smoothing (SAG) sag and the proposed semantic smoothing. We tokenize the input image using VQGAN esser2021taming encoder, mask the 90% of VQ tokens, and predict $\hat{{\bm{x}}}_{0,t}$ using MaskGIT maskgit. With the proposed self-guidance leveraging semantic smoothing, generated sample quality is improved by enhancing fine-scale details.
  • Figure 3: (a) Fine-tuning the feature selection module ${\mathcal{H}}_\phi$ (TOAST shi2023toast). With the auxiliary objective in \ref{['eq:loss_corr']}$, {\mathcal{H}}_\phi$ implicitly learns to smooth erroneous input ${\bm{z}}_t$ to address semantic outliers (\ref{['sec:method:body']}). (b) During the sampling steps, self-guidance can be efficiently implemented by leveraging the feature map from the generative process. ${\mathcal{H}}_\phi$ performs semantic smoothing on the input ${\bm{x}}_t$, guiding the sampling process toward enhancing fine-scale details in the generated sample.
  • Figure 4: IS vs. FID curves of various sampling methods for MGMs on ImageNet 256$\times$256 and 512$\times$512. The curve positioned towards the bottom right indicates a better trade-off between sample quality and diversity. We plot the curve by varying the sampling temperature ($\tau$), and the curves of MaskGIT maskgit and Token-Critic tokencritic are taken from Token-Critic tokencritic.
  • Figure 5: Sampled images on ImageNet 256$\times$256 class conditional generation using selected classes (105: Koala, 661: model T, and 933: Cheeseburger). left: LDM ldm + CFG (s=1.5, NFE=250$\times$2), middle: MaskGIT (NFE=18), right: Ours (s=1.0, NFE= 18$\times$2).
  • ...and 3 more figures