Table of Contents
Fetching ...

LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang

TL;DR

This work addresses the efficiency bottleneck of Masked Autoregressive (MAR) models, whose bidirectional attention prevents effective KV caching. It introduces LazyMAR, a training-free, plug-and-play caching framework that exploits two redundancies: Token Redundancy and Condition Redundancy, via a Token Cache and a Condition Cache with a periodic cache-refresh strategy to bound error accumulation. By computing all tokens and both conditional/unconditional paths only in initial steps and then selectively reusing cached features or residuals, LazyMAR achieves about a 2.83× acceleration with minimal degradation in image quality on ImageNet 256×256 across MAR variants. The method is validated through extensive ablations, showing that both caches contribute to speedups and that a similarity-based token selection strategy yields the best results, making high-speed MAR generation practical without extra training.

Abstract

Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in https://github.com/feihongyan1/LazyMAR.

LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

TL;DR

This work addresses the efficiency bottleneck of Masked Autoregressive (MAR) models, whose bidirectional attention prevents effective KV caching. It introduces LazyMAR, a training-free, plug-and-play caching framework that exploits two redundancies: Token Redundancy and Condition Redundancy, via a Token Cache and a Condition Cache with a periodic cache-refresh strategy to bound error accumulation. By computing all tokens and both conditional/unconditional paths only in initial steps and then selectively reusing cached features or residuals, LazyMAR achieves about a 2.83× acceleration with minimal degradation in image quality on ImageNet 256×256 across MAR variants. The method is validated through extensive ablations, showing that both caches contribute to speedups and that a similarity-based token selection strategy yields the best results, making high-speed MAR generation practical without extra training.

Abstract

Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in https://github.com/feihongyan1/LazyMAR.

Paper Structure

This paper contains 30 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Token Redundancy in MARs: In each decoding step of MAR, tokens can be divided into four types, including the token to be decoded in this step ($t$), the token just decoded in the last step ($t-1$), the tokens that have decoded before last step, and the tokens that have not been decoded. Different tokens exhibit different similarities in the adjacent decoding steps.
  • Figure 2: Conditional Redundancy in MARs: Both the conditional and unconditional output exhibit significant distance in the adjacent decoding steps while their residual (cond. - uncond.) exhibits minor distance, which allows for caching and then reusing.
  • Figure 3: The pipeline of LazyMAR: We compute all the tokens and both the conditional and unconditional pathways in the first several steps, which are important to generate the basic structure of the image. Then, we apply the token cache and condition cache periodically until the final step. (a) Token Cache: In the first step of each period, we compute all the tokens and store their features to initialize the two caches. Then, in the following steps, we only compute all tokens in the first three layers and compare their difference with their values in previous steps. After that, we only compute the tokens with large differences in the following layers and skip other tokens by reusing their features in the token cache. Meanwhile, we update the token cache with the features of tokens that have been computed. (b) Condition Cache: We compute both the conditional and unconditional pathways and store their residuals at the first step of each period. Then, we compute only the condition pathway and then approximate the output of the unconditional pathway by reusing the residual in the condition cache.
  • Figure 4: Qualitative comparison between LazyMAR and the acceleration achieved through step reduction. Experimental results show that our method maintains good consistency with the original images in both structure and details.
  • Figure 5: (a) The relationship between the difference of tokens in adjacent steps at the early layer and the final layer. A strong positive correlation is observed, with a Pearson correlation coefficient of $\rho = 0.923$. (b)-(e) show the difference between the cached features and the corresponding current features at different positions in adjacent steps.
  • ...and 1 more figures