Table of Contents
Fetching ...

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

TL;DR

Vision-Aware Speculative Decoding (ViSpec) tackles the latency of vision-language models by addressing image-token redundancy and lost multimodal coherence during autoregressive generation. It introduces a lightweight vision adaptor to compress image patches into a small set of tokens and a global visual feature injected into all subsequent text tokens, enabling a small draft model to predict effectively while the target model verifies predictions. A synthetic long-response training regime mitigates shortcut learning and aligns draft outputs with the target model, using a loss that matches draft and target token distributions. Empirically, ViSpec achieves substantial speedups across multiple VLMs and tasks (up to about 3.22×) and outperforms baselines like Medusa and EAGLE-2, marking the first meaningful acceleration of VLM inference via speculative decoding and offering a path toward real-time multimodal generation.

Abstract

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

TL;DR

Vision-Aware Speculative Decoding (ViSpec) tackles the latency of vision-language models by addressing image-token redundancy and lost multimodal coherence during autoregressive generation. It introduces a lightweight vision adaptor to compress image patches into a small set of tokens and a global visual feature injected into all subsequent text tokens, enabling a small draft model to predict effectively while the target model verifies predictions. A synthetic long-response training regime mitigates shortcut learning and aligns draft outputs with the target model, using a loss that matches draft and target token distributions. Empirically, ViSpec achieves substantial speedups across multiple VLMs and tasks (up to about 3.22×) and outperforms baselines like Medusa and EAGLE-2, marking the first meaningful acceleration of VLM inference via speculative decoding and offering a path toward real-time multimodal generation.

Abstract

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Speedup ratios of various methods at temperature = 0, evaluated on the GQA test set using four VLMs: LLaVA-v1.6-Vicuna-7B, LLaVA-v1.6-Vicuna-13B, Qwen2.5-VL-3B-Instruct, and Qwen2.5-VL-7B-Instruct.
  • Figure 2: Overview of the ViSpec framework. Given an input image and text prompt, ViSpec compresses image tokens using a lightweight vision adaptor to produce a small set of visual tokens. These tokens are prepended to the text input and fed into the draft model's attention mechanism. A global visual feature vector, extracted from the compressed image tokens, is injected into the draft model's text generation process. The figure illustrates two decoding steps of the draft model, where $f$ denotes the target model's last-layer hidden state, $f^\prime$ the draft model's last-layer hidden state, $v$ visual embeddings, $e$ text embeddings, $c$ compressed image tokens, and $g$ the global visual feature vector.
  • Figure 3: Architecture of the vision adaptor module. A compact Transformer encoder with fixed learnable query vectors $q$ processes input visual embeddings $v$ through an attention layer, yielding a small set of compressed image tokens $c$ and a single global visual feature $g$.
  • Figure 4: Comparison of training procedures: (a) EAGLE training, (b) training with greedy target model responses without multi-token prediction, and (c) Vi-Spec training. Here, $e$ denotes input embeddings, $f$ represents target model hidden states, $\hat{f}$ indicates EAGLE draft model hidden states, $f'$ denotes ViSpec draft model hidden states, and $p$ signifies token probabilities.