Table of Contents
Fetching ...

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

TL;DR

This work investigates visual token redundancy in discrete diffusion-based multimodal LLMs (dMLLMs), comparing from-scratch diffusion models with AR-to-diffusion adaptations. It reveals that visual redundancy mainly emerges in from-scratch dMLLMs during long-answer generation and that pruning visual tokens induces information loss, with restoration capabilities differing dramatically between backbones. The study shows that layer-skipping is effective for AR-to-diffusion models, while progressive or late-step pruning better serves from-scratch models, and it identifies attention scores and answer-token logits as reliable pruning signals. These findings reframe visual redundancy as a restoration-driven property rather than a simple token dispensability issue, guiding practical pruning strategies to balance efficiency and accuracy across diverse multimodal tasks. Overall, the results offer actionable insights for accelerating dMLLM inference without severely degrading performance, broadening their applicability in real-world multimodal understanding tasks.

Abstract

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

TL;DR

This work investigates visual token redundancy in discrete diffusion-based multimodal LLMs (dMLLMs), comparing from-scratch diffusion models with AR-to-diffusion adaptations. It reveals that visual redundancy mainly emerges in from-scratch dMLLMs during long-answer generation and that pruning visual tokens induces information loss, with restoration capabilities differing dramatically between backbones. The study shows that layer-skipping is effective for AR-to-diffusion models, while progressive or late-step pruning better serves from-scratch models, and it identifies attention scores and answer-token logits as reliable pruning signals. These findings reframe visual redundancy as a restoration-driven property rather than a simple token dispensability issue, guiding practical pruning strategies to balance efficiency and accuracy across diverse multimodal tasks. Overall, the results offer actionable insights for accelerating dMLLM inference without severely degrading performance, broadening their applicability in real-world multimodal understanding tasks.

Abstract

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.

Paper Structure

This paper contains 46 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Left: Visual token pruning affects dMLLMs much more than autoregressive MLLMs under different retention ratios. The nine curves show the model accuracy while applying three prevalent pruning methods (i.e., FastV chen2024fastv, SparseVLM zhang2024sparsevlm, and DivPrune alvar2025divprune) to one representative autoregressive MLLM (i.e., LLaVA-NeXT liu2024llavanext with the three solid-line curves on the top) and two representative dMLLMs (i.e., the AR-to-diffusion dMLLM LaViDa-Dream li2025lavida and the from-scratch dMLLM LLaDA-V you2025lladav with the six dotted-line curves at the bottom). Right: Performance evolution while starting pruning at different denoising steps (until the last denoising step). From-scratch LLaDA-V can recover pruning-induced information loss and achieve much higher accuracy (than AR-to-diffusion LaViDa-Dream) consistently while starting pruning at different denoising steps (denoising steps are projected to 0-100 for visualization).
  • Figure 2: Visualization of the fraction of attention from answer tokens to each token type across layers, and of logit dynamics across denoising steps. (a, d) show the heatmaps and attention ratio variations of LaViDa-Dream li2025lavida (representing AR-to-diffusion dMLLMs) across both short- and long-answer tasks. (b, e) correspond to LLaDA-V you2025lladav (representing from-scratch dMLLMs) on paragraph-level long-answer tasks (Video Detail Description lmmslab2024videodetailcaption). (c, f) present LLaDA-V on sentence-level long-answer tasks (InfoVQA mathew2022infographicvqa and DocVQA mathew2021docvqa). (g) shows the attention ratio trends of LLaDA-V on short-answer tasks, and (h) depicts those of MLLMs on both short- and long-answer tasks. We observe three key patterns: (1) Compared with MLLMs, both dMLLM variants exhibit stronger reliance on visual tokens, leading to a more pronounced performance drop when visual tokens are pruned. (2) The self-attention intensity among answer tokens progressively increases from MLLMs to LaViDa-Dream and LLaDA-V, endowing dMLLMs, especially LLaDA-V, with a stronger capacity to recover lost information through bidirectional contextual refinement. (3) Steps 7–8 in (c) and (f) reveal a consistent pattern where a rise in logits follows a preceding surge in answer-token self-attention, suggesting that stronger self-attention facilitates information restoration during denoising.