Table of Contents
Fetching ...

Masked Generative Nested Transformers with Decode Time Scaling

Sahil Goyal, Debapriya Tula, Gagan Jain, Pradeep Shenoy, Prateek Jain, Sujoy Paul

TL;DR

MaGNeTS tackles inference efficiency in visual generation by dynamically allocating compute through a decode-time model scheduling strategy that progressively scales model size across decoding iterations, combined with KV caching in parallel decoding. It builds on MaskGIT by introducing nested, parameter-sharing transformers (per MatFormer ideas) and online distillation, enabling smaller sub-models to process more tokens while larger sub-models refine details, achieving roughly $3\times$ lower GFLOPs with competitive image and video quality on ImageNet 256×256, UCF101, and Kinetics600. The approach yields substantial practical gains for high-resolution image synthesis and video frame prediction, offering a generalizable path toward efficient, scalable generative models. These contributions advance real-time or near-real-time generation capabilities without sacrificing fidelity, with potential applicability across tokenizers and sampling schemes.

Abstract

Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$\times$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3\times$ less compute than baseline, our model obtains competitive performance.

Masked Generative Nested Transformers with Decode Time Scaling

TL;DR

MaGNeTS tackles inference efficiency in visual generation by dynamically allocating compute through a decode-time model scheduling strategy that progressively scales model size across decoding iterations, combined with KV caching in parallel decoding. It builds on MaskGIT by introducing nested, parameter-sharing transformers (per MatFormer ideas) and online distillation, enabling smaller sub-models to process more tokens while larger sub-models refine details, achieving roughly lower GFLOPs with competitive image and video quality on ImageNet 256×256, UCF101, and Kinetics600. The approach yields substantial practical gains for high-resolution image synthesis and video frame prediction, offering a generalizable path toward efficient, scalable generative models. These contributions advance real-time or near-real-time generation capabilities without sacrificing fidelity, with potential applicability across tokenizers and sampling schemes.

Abstract

Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost less compute than baseline, our model obtains competitive performance.

Paper Structure

This paper contains 17 sections, 6 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Class-conditional image generation on ImageNet256x256.. Comparing MaskGIT++ and MaGNeTS (size: L).
  • Figure 1: Class-conditional Image Generation on ImageNet $256\times256$. “# steps” refers to the number of neural network runs. $^{\square}$ denotes values taken from prior publications. $^{*}$ indicates usage of extra training data. $g$ denotes use of classifier-free guidance ho2022classifierfreediffusionguidance for all steps. $g_{x}$ represents use of guidance only for final $x$ steps.
  • Figure 2: MaGNeTS Decoding. We start from the smallest nested model with an empty cache and gradually move to bigger models over the decoding iterations. We iterate using a particular sized model for a few iterations, before moving onto the next model size. As we cache the key-value pairs for the unmasked tokens, the KV cache size also increases over time. We also refresh the cache when we switch models, hence its dimension also increases over decoding iterations.
  • Figure 3: Unmasked Token Density visualization in each decoding iteration averaged over 50k generated samples on ImageNet. Yellow represents higher density. Each pixel represent a token from $16 \times 16$ latent token space. (See \ref{['appendix:motivation']} for category-wise token density).
  • Figure 4: Nested Models at different decoding iterations. Different values of the downscaling factor $p$ correspond to the nested models. The diameter of the blobs indicates #iterations.
  • ...and 8 more figures