HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Zhinan Xie; Peisong Wang; Shuang Qiu; Jian Cheng

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Zhinan Xie, Peisong Wang, Shuang Qiu, Jian Cheng

TL;DR

HiViS tackles speculative decoding inefficiency in vision–language models by removing explicit visual tokens from the drafter and leveraging the target VLM as a semantic fusion module to provide visual semantics through visual‑injected text embeddings. A time‑step‑aware residual training scheme enables autonomous drafting while progressively aligning the drafter with the target’s multimodal semantics. Across multiple VLMs and benchmarks, HiViS delivers substantial speedups (up to 3.15×) and higher average acceptance lengths with preservation of the target distribution, and ablations validate the design choices. The approach reduces the computational burden of multimodal inference and paves the way for more lightweight drafters without sacrificing accuracy or fidelity.

Abstract

Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

TL;DR

Abstract

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)