Table of Contents
Fetching ...

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Zhinan Xie, Peisong Wang, Shuang Qiu, Jian Cheng

TL;DR

HiViS tackles speculative decoding inefficiency in vision–language models by removing explicit visual tokens from the drafter and leveraging the target VLM as a semantic fusion module to provide visual semantics through visual‑injected text embeddings. A time‑step‑aware residual training scheme enables autonomous drafting while progressively aligning the drafter with the target’s multimodal semantics. Across multiple VLMs and benchmarks, HiViS delivers substantial speedups (up to 3.15×) and higher average acceptance lengths with preservation of the target distribution, and ablations validate the design choices. The approach reduces the computational burden of multimodal inference and paves the way for more lightweight drafters without sacrificing accuracy or fidelity.

Abstract

Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

TL;DR

HiViS tackles speculative decoding inefficiency in vision–language models by removing explicit visual tokens from the drafter and leveraging the target VLM as a semantic fusion module to provide visual semantics through visual‑injected text embeddings. A time‑step‑aware residual training scheme enables autonomous drafting while progressively aligning the drafter with the target’s multimodal semantics. Across multiple VLMs and benchmarks, HiViS delivers substantial speedups (up to 3.15×) and higher average acceptance lengths with preservation of the target distribution, and ablations validate the design choices. The approach reduces the computational burden of multimodal inference and paves the way for more lightweight drafters without sacrificing accuracy or fidelity.

Abstract

Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.

Paper Structure

This paper contains 23 sections, 6 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Input modalities for drafting.(a) Naive: feed both visual and textual tokens directly to the drafter. (b) Extensions: Additional module pre-processing visual and textual tokens before drafting. (c) HiViS: removes visual tokens and uses a semantic fusion module to enrich textual tokens with visual information.
  • Figure 2: Visualization of hidden state magnitude distributions across layers. The gray divider marks the boundary between visual and text tokens.
  • Figure 3: Visualization of vision attention across layers and tokens. (a) Mean attention from post-visual tokens to visual tokens across layers. The gray plane separates the instruction region from the generated text. (b) Attention variation curves obtained over layers. (c) Attention variation curves obtained over tokens.
  • Figure 4: Overall framework of HiViS during the draft-verify process. The target VLM first performs the multimodal prefill to generate fused features $f_{\text{fused}}$ from both visual embeddings $e_{\text{visual}}$ and textual embeddings $e_{\text{text}}$ . These fused features are used to construct the visual injected text embeddings $\hat{e}_{\text{text}}$ that serve as inputs to the drafter. During the draft stage, the drafter generates multiple candidate, each step refined by a step-dependent bias-correction residual $r$. In the verify stage, the target VLM evaluates all candidates in parallel and accepts the consecutive correct sequence.
  • Figure 5: Training architecture of HiViS. $e$ are input embeddings, $f_{\text{fused}}$ and $x$ are fused feature and tokens from the target VLM, while $f'$ denotes the drafter-predicted hidden states and $x'^k$ the corresponding top-k token candidates drawn from its output distribution. Correct predictions are marked with $\checkmark$, errors with $\times$, and gray tokens denote discarded candidates.
  • ...and 2 more figures