MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Hui Shen; Xin Wang; Ping Zhang; Yunta Hsieh; Qi Han; Zhongwei Wan; Ziheng Zhang; Jingxuan Zhang; Jing Xiong; Ziyuan Liu; Yifan Zhang; Hangrui Cao; Chenyang Zhao; Mi Zhang

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang

Abstract

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Abstract

Paper Structure (25 sections, 4 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Efficient Vision-Language Models
Speculative Decoding
The MMSpec Benchmark
Dataset Construction
Speculative Decoding Algorithms
Training-based Methods
Training-free Methods
Experiments
Experiment Setup
Overall Comparison
Sensitivity Study
Latency Analysis
ViSkip
...and 10 more sections

Figures (6)

Figure 1: Performance comparison of speculative decoding methods with Qwen2.5-VL-7B on MMSpec benchmark.
Figure 2: Sampled data from MMSpec
Figure 3: Overview of speculative decoding algorithms evaluated in MMSpec framework.
Figure 4: Speedup comparison of speculative decoding methods across different batch sizes. The speedup is measured relative to autoregressive decoding.
Figure 5: CDF of per-sample latency for different speculative decoding methods.
...and 1 more figures

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Abstract

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Authors

Abstract

Table of Contents

Figures (6)