Table of Contents
Fetching ...

Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

Wengyi Zhan, Mingbao Lin, Zhihang Lin, Rongrong Ji

TL;DR

ParVTS tackles the latency bottleneck of multimodal LLMs caused by quadratic self-attention and abundant visual tokens by introducing a training-free vision token scheduling framework. It partitions visual tokens into subject and non-subject groups, runs parallel processing to migrate their semantics into the question tokens, and discards the non-subject path mid-inference to achieve substantial speedups and FLOP reductions without extra modules or training. The method leverages the natural visual-to-text information migration observed in early transformer layers and uses a parallel execution strategy with fusion weights to maintain representation quality. Across multiple backbones and benchmarks, ParVTS achieves up to 88.9% token pruning with minimal accuracy loss, up to 1.77x speedup, and about 70% FLOPs reduction, demonstrating practical, scalable efficiency gains for real-world multimodal reasoning tasks.

Abstract

Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.

Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

TL;DR

ParVTS tackles the latency bottleneck of multimodal LLMs caused by quadratic self-attention and abundant visual tokens by introducing a training-free vision token scheduling framework. It partitions visual tokens into subject and non-subject groups, runs parallel processing to migrate their semantics into the question tokens, and discards the non-subject path mid-inference to achieve substantial speedups and FLOP reductions without extra modules or training. The method leverages the natural visual-to-text information migration observed in early transformer layers and uses a parallel execution strategy with fusion weights to maintain representation quality. Across multiple backbones and benchmarks, ParVTS achieves up to 88.9% token pruning with minimal accuracy loss, up to 1.77x speedup, and about 70% FLOPs reduction, demonstrating practical, scalable efficiency gains for real-world multimodal reasoning tasks.

Abstract

Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.

Paper Structure

This paper contains 25 sections, 25 equations, 4 figures, 24 tables.

Figures (4)

  • Figure 1: Subject-oriented question distribution and examples across VQA Datasets lu2022sqaliu2024ocrbenchSingh_2019_CVPR_textvqaai2d. Left: Percentage of questions focused on subject content, with annotation details in Appendix \ref{['sec:subject_annotation']}. Right: Visual examples contrasting subject-relevant (Q1) and non-subject (Q2) questions.
  • Figure 2: The framework of ParVTS. (a) Sequential vision token scheduling (taking non-subject-first as an example) injects token groups at different transformer layers, leading to a series of issues. (b) Our parallel scheduling enables both token groups to participate in early layers simultaneously, ensuring sufficient information migration and consistent representation with low inference cost.
  • Figure 3: ParVTS performance on the LISA lai2024lisa segmentation task.
  • Figure 4: Visualization of [CLS] token attention and qualitative results of ParVTS on LLaVA-1.5-7B.