Plug-and-Play Context Feature Reuse for Efficient Masked Generation
Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang
TL;DR
MGMs aim to model the joint distribution over image tokens via masking but face high latency when decoding many tokens per step. This paper proposes ReCAP, a plug-and-play mechanism that interleaves Full-FE and Local-FE steps and reuses cached attention KV to achieve context feature reuse, reducing per-step cost from $O(N^2)$ to $O( hat{n}_t N)$ while preserving dependencies. Evaluations on MaskGIT, MAR, and MAGE across ImageNet256 show up to 2.4x faster inference with minimal fidelity loss across discrete and continuous MGMs, demonstrating consistent efficiency–fidelity gains. ReCAP is architecture-agnostic, training-free, and broadly applicable to discrete and continuous token MGMs, suggesting practical impact for deploying high-fidelity image generation at lower latency.
Abstract
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.
