Table of Contents
Fetching ...

Halton Scheduler For Masked Generative Image Transformer

Victor Besnier, Mickael Chen, David Hurych, Eduardo Valle, Matthieu Cord

TL;DR

The paper targets scheduling in MaskGIT to improve information gain during token unmasking. It provides a mutual-information analysis showing that the traditional Confidence scheduler creates token clustering and limited long-range gains, while a Halton low-discrepancy sequence spreads token selection to maximize information per step. The Halton scheduler is a drop-in replacement requiring no retraining or noise injection and yields lower FID and higher image diversity on ImageNet and COCO benchmarks. Empirically, Halton achieves about a 2.2–2.7 point FID improvement over Confidence across class-to-image and text-to-image tasks, with enhanced background detail and texture. Overall, the approach narrows the gap between masked-image modeling and diffusion-style methods while preserving fast inference.

Abstract

Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.

Halton Scheduler For Masked Generative Image Transformer

TL;DR

The paper targets scheduling in MaskGIT to improve information gain during token unmasking. It provides a mutual-information analysis showing that the traditional Confidence scheduler creates token clustering and limited long-range gains, while a Halton low-discrepancy sequence spreads token selection to maximize information per step. The Halton scheduler is a drop-in replacement requiring no retraining or noise injection and yields lower FID and higher image diversity on ImageNet and COCO benchmarks. Empirically, Halton achieves about a 2.2–2.7 point FID improvement over Confidence across class-to-image and text-to-image tasks, with enhanced background detail and texture. Overall, the approach narrows the gap between masked-image modeling and diffusion-style methods while preserving fast inference.

Abstract

Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.

Paper Structure

This paper contains 22 sections, 8 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Text-to-Image samples comparison. The Halton scheduler allows sampling more diverse and more detailed images than the traditional confidence scheduler, which is visible both in the foreground elements and the background.
  • Figure 2: (Left) MaskGIT image generation. From masked tokens and the class token, MaskGIT samples the full image step-by-step, using a scheduler to pick which tokens to unmask. After $S$ steps, all tokens are sampled, and a deterministic decoder transforms the entire sequence into an image. (Right) The Halton scheduler employs the quasi-random Halton sequence to strategically distribute tokens across the image, reducing the correlation between tokens sampled in the same step and maximizing the information they provide.
  • Figure 3: Evolving predictions of different schedulers on a reconstruction comparison. (left) MaskGIT predicted entropy maps, with darker colors for lower entropies and known tokens in dark red. Images are reconstructed (right) by revealing the ground-truth tokens of a reference image at the locations selected by the scheduler. The Confidence scheduler picks the most certain tokens first and, thus, tends to cluster around already unmasked areas. Adding noise alleviates but does not solve the problem. The Halton scheduler provides a more uniform image coverage at each step of the sampling process.
  • Figure 4: Mutual Information scheduler analysis. (x-axis) The sum of the entropy of the unmasked tokens in $\mathcal{X}_s$ corresponds to term \ref{['eq:mi_ent']} .a. (y-axis) The sum of the distances of each unmasked token in $\mathcal{X}_s$ to their closest neighbor is a reversed proxy for the term \ref{['eq:mi_ent']} .b. Each scheduler draws a curve in this plot as the number of sampling steps progresses. The Halton scheduler stays closer to the top-left corner, the desirable part of the plot.
  • Figure 5: Ablation on the number of steps. The Halton scheduler demonstrates scalability as the number of steps increases, whereas the Confidence scheduler's performance deteriorates. The Halton scheduler consistently outperforms the Confidence scheduler when the number of steps exceeds 12.
  • ...and 8 more figures