Table of Contents
Fetching ...

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, Eunho Yang

TL;DR

This paper studies speeding up visual autoregressive image generation by extending speculative decoding, a technique proven in LLMs, to visual AR models. The key challenge is token selection ambiguity, where next-token probabilities are broadly distributed and draft tokens are frequently rejected. The authors introduce LANTERN, which relaxes the acceptance condition by aggregating probabilities over latent-space neighborhoods (A_{k,δ}) of draft tokens and enforces a total variation distance bound to limit distortion, dramatically increasing draft acceptance and achieving substantial speed-ups (up to ~1.75×–2.26×) with minimal quality loss on models like LlamaGen-XL and Anole. This approach enables practical acceleration of visual AR generation, with tunable trade-offs between speed and image quality, and broadens the applicability of speculative decoding to multi-modal autoregressive generation.

Abstract

Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a naïve application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by $\mathbf{1.75}\times$ and $\mathbf{1.82}\times$, as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model. The code is publicly available at https://github.com/jadohu/LANTERN.

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

TL;DR

This paper studies speeding up visual autoregressive image generation by extending speculative decoding, a technique proven in LLMs, to visual AR models. The key challenge is token selection ambiguity, where next-token probabilities are broadly distributed and draft tokens are frequently rejected. The authors introduce LANTERN, which relaxes the acceptance condition by aggregating probabilities over latent-space neighborhoods (A_{k,δ}) of draft tokens and enforces a total variation distance bound to limit distortion, dramatically increasing draft acceptance and achieving substantial speed-ups (up to ~1.75×–2.26×) with minimal quality loss on models like LlamaGen-XL and Anole. This approach enables practical acceleration of visual AR generation, with tunable trade-offs between speed and image quality, and broadens the applicability of speculative decoding to multi-modal autoregressive generation.

Abstract

Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a naïve application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by and , as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model. The code is publicly available at https://github.com/jadohu/LANTERN.
Paper Structure (42 sections, 6 equations, 13 figures, 12 tables, 2 algorithms)

This paper contains 42 sections, 6 equations, 13 figures, 12 tables, 2 algorithms.

Figures (13)

  • Figure 1: Images generated by vanilla decoding (top) and lossy speculative decoding with our relaxed acceptance condition (bottom) on the text-conditioned LlamaGen-XL Stage ii llama-gen. The mean accepted length for each image is displayed in white at the bottom right corner of each image.
  • Figure 2: (a) Mean accepted length of naïve application of existing speculative decoding methods on visual AR model and LLM counterpart. (b) Top-1 and top-3 accuracy of learned drafter model for predicting the target model's outputs. (c) An average top-1 and top-10 probabilities in the next token prediction.
  • Figure 3: Image generated by text-conditioned LlamaGen-XL Stage ii model llama-gen. The images are generated by either standard sampling method (top) or sampling with random replacement within 100-closest tokens in the latent space (bottom).
  • Figure 4: Qualitative samples generated by LlamaGen-XL Stage ii model for LANTERN and standard autoregressive decoding. From top to bottom, the images are generated by standard autoregressive decoding, LANTERN ($\delta=0.2$, $\delta=0.4$) where $k$ is fixed at 1000, and images in the same column are generated using the same text prompt. Text prompts for the images are provided in Appendix \ref{['sec:qual_prompts']}.
  • Figure 5: Trade-off curves show the relationship between performance (FID) and acceleration (mean accepted length). The results with the same $k$ are annotated with the same color, while the same $\delta$ values are marked with identical symbols. In the legend, the values are separated by commas, indicating $k$ and $\delta$, respectively.
  • ...and 8 more figures