Table of Contents
Fetching ...

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, Zhi Wang

TL;DR

VVS introduces partial verification skipping into speculative decoding for visual autoregressive generation, explicitly reducing target-model forward passes and thereby lowering inference latency. By identifying verification redundancy in candidate token trees and the reusable nature of stale intermediate features, VVS combines a verification-free token selector, token-level feature caching, and a fine-grained skipped-step scheduler to maintain generation quality. Empirical results show up to $2.86\times$ fewer forward passes and up to $1.76\times$ wall-clock speedup with minimal quality degradation, outperforming traditional SD approaches in speed-quality trade-offs. This work offers a practical acceleration pathway for visual AR models and suggests further refinements in draft-model training to maximize benefits from partial verification skipping.

Abstract

Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

TL;DR

VVS introduces partial verification skipping into speculative decoding for visual autoregressive generation, explicitly reducing target-model forward passes and thereby lowering inference latency. By identifying verification redundancy in candidate token trees and the reusable nature of stale intermediate features, VVS combines a verification-free token selector, token-level feature caching, and a fine-grained skipped-step scheduler to maintain generation quality. Empirical results show up to fewer forward passes and up to wall-clock speedup with minimal quality degradation, outperforming traditional SD approaches in speed-quality trade-offs. This work offers a practical acceleration pathway for visual AR models and suggests further refinements in draft-model training to maximize benefits from partial verification skipping.

Abstract

Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.

Paper Structure

This paper contains 28 sections, 3 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of VVS framework. VVS explicitly reduce the target model forward passes by bypassing part verification stages, thereby cutting the inference latency during SD. $D_t$ denotes draft stage at iteration $t$, $V_t$ denotes verification stage at iteration $t$.
  • Figure 2: Similarity of the drafted candidate token tree. (a) Visual similarity among different token paths within a candidate token tree. (b) Average similarity distribution of token trees.
  • Figure 3: Token-level feature similarity across different staleness during SD generation. As staleness increases, the similarity between stale features and fresh features decreases progressively.
  • Figure 4: Mean accept length comparison under feature blending with different additional staleness. $s$ denotes the extra staleness introduced for cached features; $s=-1$ indicates using the freshest features; $s=0$ indicates using the most recent features cached from prior steps; $s=i~(i>0)$ indicates using features with additional staleness $i$ compared with $s=0$.
  • Figure 5: (a) Inference pipeline of our SD framework VVS, which supports partial verification skipping. (b) Token-level feature caching and reuse mechanism. Since the number of tokens accepted at different iterations varies and truncation in \ref{['sec:4.2']} is applied, the cached features to be reused could come from multiple steps. Tokens accepted without verification proceed to the target model at the next verification step—which we term post verification—and their resulting features are cached back to the corresponding positions.
  • ...and 5 more figures