Table of Contents
Fetching ...

MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park

TL;DR

This work tackles the slow inference of autoregressive visual generation by augmenting Speculative Jacobi Decoding (SJD) with coupling-based draft sampling. The authors show that instability in independent draft tokens limits SJD’s speedups, and introduce MC-SJD, which uses maximal (or Gumbel) coupling to dramatically raise the probability that consecutive drafts produce identical tokens while preserving the exact target distribution. The result is a training-free, lossless acceleration achieving up to ~4× speedups for image generation and ~13× for video generation, with no degradation in quality metrics. This approach leverages fixed-point perspective and information-theoretic coupling to stabilize iteration trajectories, enabling practical, scalable AR vision generation.

Abstract

While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

TL;DR

This work tackles the slow inference of autoregressive visual generation by augmenting Speculative Jacobi Decoding (SJD) with coupling-based draft sampling. The authors show that instability in independent draft tokens limits SJD’s speedups, and introduce MC-SJD, which uses maximal (or Gumbel) coupling to dramatically raise the probability that consecutive drafts produce identical tokens while preserving the exact target distribution. The result is a training-free, lossless acceleration achieving up to ~4× speedups for image generation and ~13× for video generation, with no degradation in quality metrics. This approach leverages fixed-point perspective and information-theoretic coupling to stabilize iteration trajectories, enabling practical, scalable AR vision generation.

Abstract

While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

Paper Structure

This paper contains 24 sections, 5 theorems, 19 equations, 11 figures, 3 tables, 6 algorithms.

Key Result

Proposition 1

Let $q$ be the draft distribution and $x\sim q(x)$, then, final output from $\texttt{MRS}$(Alg.3) strictly follow the distribution of target model $p(x)$. Moreover, the acceptance rate of this algorithm is defined as where $\mathcal{D}_{TV}$ denotes total variation $\frac{1}{2}\sum_v|p(v)-q(v)|$.

Figures (11)

  • Figure 1: Comparison of recent SD methods for AR image generation. While recent works suffer from limited acceleration or sacrifice the quality, our MC-SJD achieves up to $\sim$4$\times$ speedup over standard AR without any quality degradation.
  • Figure 2: Generation NFE v.s Mean Token Difference during SJD with window size $L=64$. As shown, a sample that is generated with smaller NFE tends to have small mean token difference.
  • Figure 3: (a), (b) The trajectory of tokenwise acceptance rate $\beta^t_i$ during the jacobi iterations (a) Standard SJD shows most tokens have large variation during iteration and do not exhibit improvement behavior. (b) After applying our coupled sampler $\pi_{MC}$. Now most of tokens has very small fluctuation, showing general upward trends. (c) Mean and variance of $\beta^t_i$ across all token index. While standard SJD does not show improvement, ours shows clear upward, refining behavior.
  • Figure 4: Visualization of Collision probabilities. (a) During standard SJD, $C_{SJD}$ are concentrated on extremely small values. (b) Our Coupler elevates this to much higher values, significantly enhancing the context similarity. (c) Standard SJD has a low $\Pr[X=Y]$ even when the corresponding TV distance is low. The green dot-line denotes the $\pi_{GS}$ lower bound $\pi_{GS}\ge(1-\mathcal{D}_{TV})/(1+\mathcal{D}_{TV})$.
  • Figure 5: Qualitative comparison between Ours v.s. AR on Lumina-mGPT. (zoom-in to view).
  • ...and 6 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Proposition 2: SJD Collision Probability
  • Definition 1: Coupling
  • Theorem 1
  • Definition 2: Coupling Cost
  • Theorem 2
  • Theorem 3