Table of Contents
Fetching ...

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Souvik Kundu, Peter A. Beerel

TL;DR

RedVTP introduces a training-free, response-driven visual token pruning method for diffusion vision-language models. By measuring visual token importance from the attention of still-masked response tokens after the first inference step, it prunes a top-$r$ subset of visual tokens and uses the pruned set for subsequent steps, achieving substantial efficiency gains with minimal accuracy loss. Across six benchmarks and two pioneering DVLMs, RedVTP delivers significant latency reductions and throughput improvements, while sometimes even improving accuracy on certain tasks. The approach is orthogonal to KV-cache techniques and offers a practical path to deploying DVLMs with lower computational demands.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

TL;DR

RedVTP introduces a training-free, response-driven visual token pruning method for diffusion vision-language models. By measuring visual token importance from the attention of still-masked response tokens after the first inference step, it prunes a top- subset of visual tokens and uses the pruned set for subsequent steps, achieving substantial efficiency gains with minimal accuracy loss. Across six benchmarks and two pioneering DVLMs, RedVTP delivers significant latency reductions and throughput improvements, while sometimes even improving accuracy on certain tasks. The approach is orthogonal to KV-cache techniques and offers a practical path to deploying DVLMs with lower computational demands.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the average numbers of visual and prompt tokens on InfoVQA mathew2021infographicvqa using (a) LLaDA-V and (b) LaViDa.
  • Figure 2: The framework of RedVTP on DVLMs. a) The proposed DVLM process: We apply RedVTP on the original diffusion language model. b) DVLM with RedVTP applied after the $1^{st}$ inference step. c) We collect the attention map average on heads and layers from the $1^{st}$ inference step and calculate the cumulative attention received from all still-masked response tokens of each image tokens to get their importance scores. Based on the importance scores, we select top-$r$ proportion of visual tokens to retain for the remaining inference steps.
  • Figure 3: Visualization of visual token pruning results based on masked token-guided importance scores using examples from RealworldQA. The yellow circled regions indicate the areas that the correct responses should attend to. (a) under $r = 50\%$; (b) under $r = 25\%$. As can be observed, the regions essential for generating the correct responses are consistently retained under different settings.
  • Figure 4: The averaged masked token-guided importance score similarity between $S_1$ and each $S_k$ ($1 < k \leq 15$), computed using 20% of the samples from InfoVQA with LLaDA-V. $\text{Sim}_k$ denotes the cosine similarity between $S_k$ and $S_1$. From the curve, we can observe that all $\text{Sim}_k$ values are higher than 0.95.
  • Figure 5: Two examples from DocVQA and InfoVQA using LLaDA-V. It can be observed that the model can correctly answer the questions even with only $25\%$ visual tokens being retained.