Table of Contents
Fetching ...

ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang

Abstract

Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.

ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

Abstract

Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by with high accepted lengths, and accelerates various video understanding benchmarks by 3.36 on LLaVA-Onevision-72B and 2.42 on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
Paper Structure (20 sections, 6 theorems, 22 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 6 theorems, 22 equations, 11 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Denote the video token pruning method as $\mathcal{P}$ and the pruning ratio as $\alpha \in [0,1]$. ParallelVLM overlaps the draft and verification stages, corresponding to an enlarged draft/target speed ratio $c^{*}(\alpha)>c$, and a robust average acceptance rate $\hat{\tau}(\mathcal{P},\alpha)$.

Figures (11)

  • Figure 1: Comparison of ParallelVLM with vanilla autoregressive decoding and visual token pruning. ParallelVLM achieves up to $3.36\times$ acceleration for LLaVA-OV-72B and $2.42\times$ for Qwen2.5-VL-32B across various video understanding benchmarks.
  • Figure 2: Comparison of different training-free speculative decoding frameworks for Video-LLMs. (a) Vanilla SD: sequential in both prefilling and decoding stages, achieving speedup only with the draft-then-verify design. (b) SpecVLM Ji2025SpecVLMES: inherits vanilla SD and enables faster draft model execution via attention-guided pruning. (c) ParallelVLM: adopts a parallel pipeline to mitigate idling time in vanilla SD, co-designed with UV-Prune strategy to expand the draft window size meanwhile preserving draft/target alignment.
  • Figure 3: Scaling of (a) Prefilling Latency and (b) Decoding Time with increasing number of video tokens. Experiments are conducted on LLaVA-OV 7B draft model and 72B target model.
  • Figure 4: Target model's attention guidance for draft video token pruning. We observe positional bias on both ends of the video tokens. When retention rate is 10%, frames 1, 125-128 accumulate up to 20.9% of total selected tokens within only 4% position width.
  • Figure 5: The proposed ParallelVLM consists of Parallel Prefilling (PP), Parallel Decoding (PD) and the Unbiased Verifier-guided Pruning (UV-Prune). (a) During PP stage: draft model prefilling and pruning executes in parallel with target model prefilling, where UV-Prune transfers salient alignment semantics from the target model to the draft model without "positional bias". (b) During PD stage: the reduced draft decoding time $T_q$ enables an enlarged window size $\gamma$in parallel with verification, meanwhile maintaining high acceptance rates $\tau$.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 1: Speedup Ratio of ParallelVLM
  • Theorem 2: Vanilla SD (Ideal)
  • Theorem 3: Parallel SD (Ideal)
  • Theorem 4: Vanilla SD (Practical)
  • Theorem 5: Parallel SD (Practical)
  • proof
  • Theorem 6: Parallel SD (with Rollback)
  • proof