Table of Contents
Fetching ...

Plug-and-Play Context Feature Reuse for Efficient Masked Generation

Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang

TL;DR

MGMs aim to model the joint distribution over image tokens via masking but face high latency when decoding many tokens per step. This paper proposes ReCAP, a plug-and-play mechanism that interleaves Full-FE and Local-FE steps and reuses cached attention KV to achieve context feature reuse, reducing per-step cost from $O(N^2)$ to $O( hat{n}_t N)$ while preserving dependencies. Evaluations on MaskGIT, MAR, and MAGE across ImageNet256 show up to 2.4x faster inference with minimal fidelity loss across discrete and continuous MGMs, demonstrating consistent efficiency–fidelity gains. ReCAP is architecture-agnostic, training-free, and broadly applicable to discrete and continuous token MGMs, suggesting practical impact for deploying high-fidelity image generation at lower latency.

Abstract

Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.

Plug-and-Play Context Feature Reuse for Efficient Masked Generation

TL;DR

MGMs aim to model the joint distribution over image tokens via masking but face high latency when decoding many tokens per step. This paper proposes ReCAP, a plug-and-play mechanism that interleaves Full-FE and Local-FE steps and reuses cached attention KV to achieve context feature reuse, reducing per-step cost from to while preserving dependencies. Evaluations on MaskGIT, MAR, and MAGE across ImageNet256 show up to 2.4x faster inference with minimal fidelity loss across discrete and continuous MGMs, demonstrating consistent efficiency–fidelity gains. ReCAP is architecture-agnostic, training-free, and broadly applicable to discrete and continuous token MGMs, suggesting practical impact for deploying high-fidelity image generation at lower latency.

Abstract

Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: FID vs. inference time on ImageNet256 class-conditional generation. As the number of decoding steps increases, MAR li2024autoregressive achieves better FID but incurs high inference cost. ReCAP significantly accelerates MAR by replacing part of full-eval steps with low-cost steps, achieving 2.4$\times$ faster inference for MAR-Huge with minimal quality loss (FID 1.56 vs. 1.57). For fair comparison, we adopt the version of REPA without interval guidance kynkaanniemi2024applying, as also reported in the original paper yu2024representation. $^*$ denotes the use of KV caching shazeer2019fast for fast inference.
  • Figure 2: Context feature stability during iterative decoding. We measure similarity between context representations before and after token updates, using a pretrained MaskGIT on 50K ImageNet256 samples. At each decoding stage, we extract the input embeddings to the attention module for the $K$ already-decoded tokens. These are average-pooled within each layer to obtain an aggregated context vector. Cosine similarity is computed between these vectors before and after updates and averaged across layers; shaded regions indicate layer-wise standard deviation. Greater stability at larger $K$ supports reusing cached features in later decoding stages.
  • Figure 3: Grouped Decoding Pipeline with Cached Attention. Inference is organized into $T$ groups, each performing one Full-FE and several Local-FE steps. In the Full-FE, full attention is computed over the entire sequence, and KVs for the static context tokens () and other masked tokens () are cached. In each Local-FE, only the QKVs of the target tokens () are recomputed (), while the cached KVs () are reused to form the full attention context. The context feature reuse mechanism effectively reduces computation cost in local evaluation steps.
  • Figure 4: FID vs. inference time for MaskGIT variants and comparative models. $^*$: taken from the MaskGIT paper chang2022maskgit. $^\dagger$: with CFG ho2022classifier. U-ViT bao2022all adopts 7 sampling steps in this figure.
  • Figure 5: Speed/Performance trade-off for MAR variants and SoTA baselines. ReCAP consistently improves inference efficiency of MAR-Large and -Huge. VARs tian2024visual are SoTA AR models performing next-scale prediction, $^*$ denotes the use of KV caching shazeer2019fast. REPA yu2024representation, a SoTA flow-matching model relying on vision foundation models oquab2023dinov2, $^\ddagger$ denotes the use of advanced guidance interval sampling kynkaanniemi2024applying. DPM solvers lu2022dpmlu2022dpm+ augment DiT Peebles2023 and U-ViT bao2022all.
  • ...and 2 more figures