Partition Generative Modeling: Masked Modeling Without Masks
Justin Deschenaux, Lan Tran, Caglar Gulcehre
TL;DR
Partition Generative Modeling (PGM) replaces the conventional masking in Masked Generative Models with a two-group partition and a GroupSwap mechanism, enabling fast, parallel sampling by restricting cross-group information flow. The Partition Transformer architecture supports partition-wise self-attention and cross-partition conditioning without full self-attention, allowing inference to focus on the unmasked (clean) tokens while still leveraging full-token supervision during training. Empirically, PGMs achieve 5–5.5x throughput gains on language tasks and up to 7.5x gains on ImageNet compared with strong MGM baselines, with only modest drops in sample quality and notable improvements when distillation is used. The approach remains compatible with distillation techniques (SDTT) and CFG, offering a scalable, flexible alternative to MGMs for high-speed generation and potential multimodal extensions.
Abstract
Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry no information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the Partition Generative Model (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior Generative Perplexity, compared to Masked Diffusion Language Models. On ImageNet, PGMs achieve a $7.5\times$ higher throughput than MaskGIT, with only a slight increase in FID (5.54 vs. 5.35). With twice as many sampling steps, the FID reduces to 4.56 while while being $3.9\times$ faster than MaskGIT. Finally, PGMs integrate seamlessly with MGM distillation, providing further inference speedups.
