Halton Scheduler For Masked Generative Image Transformer
Victor Besnier, Mickael Chen, David Hurych, Eduardo Valle, Matthieu Cord
TL;DR
The paper targets scheduling in MaskGIT to improve information gain during token unmasking. It provides a mutual-information analysis showing that the traditional Confidence scheduler creates token clustering and limited long-range gains, while a Halton low-discrepancy sequence spreads token selection to maximize information per step. The Halton scheduler is a drop-in replacement requiring no retraining or noise injection and yields lower FID and higher image diversity on ImageNet and COCO benchmarks. Empirically, Halton achieves about a 2.2–2.7 point FID improvement over Confidence across class-to-image and text-to-image tasks, with enhanced background detail and texture. Overall, the approach narrows the gap between masked-image modeling and diffusion-style methods while preserving fast inference.
Abstract
Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.
