Table of Contents
Fetching ...

Speculative Decoding Reimagined for Multimodal Large Language Models

Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Rongrong Ji

TL;DR

This work tackles the bottleneck of inference latency in multimodal large language models by introducing Multimodal Speculative Decoding (MSD). MSD hinges on two design principles: decoupling text and visual token processing in the draft model and a two_stage training regime that first strengthens language modeling and then gradually incorporates multimodal data. Empirical results across diverse multimodal benchmarks demonstrate that MSD achieves stronger average acceptance lengths and higher speedup ratios than prior methods, approaching the performance of LLM speculative decoding in practice. The approach offers a practical, scalable path toward real_world deployment of MLLMs with reduced latency and maintained accuracy.

Abstract

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29\times$ for LLaVA-1.5-7B and up to $2.46\times$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.

Speculative Decoding Reimagined for Multimodal Large Language Models

TL;DR

This work tackles the bottleneck of inference latency in multimodal large language models by introducing Multimodal Speculative Decoding (MSD). MSD hinges on two design principles: decoupling text and visual token processing in the draft model and a two_stage training regime that first strengthens language modeling and then gradually incorporates multimodal data. Empirical results across diverse multimodal benchmarks demonstrate that MSD achieves stronger average acceptance lengths and higher speedup ratios than prior methods, approaching the performance of LLM speculative decoding in practice. The approach offers a practical, scalable path toward real_world deployment of MLLMs with reduced latency and maintained accuracy.

Abstract

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to for LLaVA-1.5-7B and up to for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.

Paper Structure

This paper contains 35 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of models trained on text-only and vision instruction data, using LLaVA-1.5-7B as the target model. Black tokens are judged correct; red and blue tokens mark visual-related and unrelated errors, respectively.
  • Figure 2: Input illustration of MLLMs. The input to MLLMs comprises two distinct token types: text and image tokens.
  • Figure 3: The framework of Multimodal Speculative Decoding (MSD). The left illustrates the draft phase, while the right illustrates the training phase. $e$ represents token embeddings, and $f$ denotes the concatenated features. Red-bordered tokens are predicted by the draft model. MSD concatenates text token hidden states with the next token embeddings, while directly inputting visual tokens embedding without concatenation during drafting. MSD trains the draft model using a text-only instruction-tuning dataset in Stage 1, and then gradually introduces a multimodal instruction-tuning dataset in Stage 2.
  • Figure 4: Comparison of acceptance rates between Baseline and MSD across different temperatures on a randomly sampled subset of 100 samples from TextVQA.
  • Figure 5: Impact of draft model training strategies, evaluated on a randomly sampled 100-example subset of ChartQA.
  • ...and 1 more figures