Table of Contents
Fetching ...

Fine-grained Token Allocation Via Operation Pruning for Efficient MLLMs

Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer

TL;DR

This work tackles the inefficiency of Multimodal Large Language Models by exposing and exploiting fine-grained computation redundancy across decoder modules. It introduces Depth-wise Operation Pruning (DOP), a framework that decomposes decoder computations into atomic operations (g,l,m) and employs depth-wise pruning with an additive divergence approximation to allocate tokens adaptively while honoring a TFLOPs budget. The approach achieves state-of-the-art efficiency across 6 MLLMs and 13 benchmarks, with substantial real-GPU speedups (e.g., up to 86% TFLOPs reduction at marginal performance loss) and strong cross-task/model generalization, while maintaining low optimization overhead (as little as 2 minutes with limited samples). These results demonstrate the practicality of per-module token allocation for accelerating MLLMs in real-world, compute-constrained settings. The work provides reproducible configurations and public-ready code, paving the way for broader adoption of fine-grained, data-efficient pruning in multimodal decoding pipelines.

Abstract

Token reduction accelerates Multimodal Large Language Models (MLLMs) by reducing excessive tokens, but overlooks structural redundancy differences, where critical and redundant modules process identical token loads. For fine-grained computation control, we define an ``operation" as the computation for a module to process a group of tokens and introduce the operation pruning framework to enable modules to selectively process tokens. Built on this framework, we propose Depth-wise Operation Pruning (DOP), a data-driven method that searches for strategies to prune redundant operations and save computational budget for critical modules to process more tokens than uniform allocation by minimizing divergence from the original model's output probability distribution on a small validation set while satisfying computational constraints. For efficient optimization, DOP applies depth-wise pruning to reduce policy space and uses an additive approximation to minimize required validation runs. Depth-wise pruning partitions operations by module type and token group, and prunes operations in deeper layers before those in shallower layers within each module-group pair. The additive approximation obtains individual divergences by independently varying each policy parameter, and then sums them to approximate the joint divergence of simultaneously changing all policy parameters, reducing required validation runs from exponential to linear with respect to the number of policy parameters. Comprehensive evaluations show that DOP establishes new state-of-the-art performance across 6 MLLMs and 13 benchmarks against 12 baselines. On LLaVA-Next-7B, DOP achieves 86\% TFLOPS reduction and 83\% latency reduction on real GPU with only 1\% performance loss. Our extensive ablation studies further demonstrate DOP's data and time efficiency as well as strong generalization capabilities.

Fine-grained Token Allocation Via Operation Pruning for Efficient MLLMs

TL;DR

This work tackles the inefficiency of Multimodal Large Language Models by exposing and exploiting fine-grained computation redundancy across decoder modules. It introduces Depth-wise Operation Pruning (DOP), a framework that decomposes decoder computations into atomic operations (g,l,m) and employs depth-wise pruning with an additive divergence approximation to allocate tokens adaptively while honoring a TFLOPs budget. The approach achieves state-of-the-art efficiency across 6 MLLMs and 13 benchmarks, with substantial real-GPU speedups (e.g., up to 86% TFLOPs reduction at marginal performance loss) and strong cross-task/model generalization, while maintaining low optimization overhead (as little as 2 minutes with limited samples). These results demonstrate the practicality of per-module token allocation for accelerating MLLMs in real-world, compute-constrained settings. The work provides reproducible configurations and public-ready code, paving the way for broader adoption of fine-grained, data-efficient pruning in multimodal decoding pipelines.

Abstract

Token reduction accelerates Multimodal Large Language Models (MLLMs) by reducing excessive tokens, but overlooks structural redundancy differences, where critical and redundant modules process identical token loads. For fine-grained computation control, we define an ``operation" as the computation for a module to process a group of tokens and introduce the operation pruning framework to enable modules to selectively process tokens. Built on this framework, we propose Depth-wise Operation Pruning (DOP), a data-driven method that searches for strategies to prune redundant operations and save computational budget for critical modules to process more tokens than uniform allocation by minimizing divergence from the original model's output probability distribution on a small validation set while satisfying computational constraints. For efficient optimization, DOP applies depth-wise pruning to reduce policy space and uses an additive approximation to minimize required validation runs. Depth-wise pruning partitions operations by module type and token group, and prunes operations in deeper layers before those in shallower layers within each module-group pair. The additive approximation obtains individual divergences by independently varying each policy parameter, and then sums them to approximate the joint divergence of simultaneously changing all policy parameters, reducing required validation runs from exponential to linear with respect to the number of policy parameters. Comprehensive evaluations show that DOP establishes new state-of-the-art performance across 6 MLLMs and 13 benchmarks against 12 baselines. On LLaVA-Next-7B, DOP achieves 86\% TFLOPS reduction and 83\% latency reduction on real GPU with only 1\% performance loss. Our extensive ablation studies further demonstrate DOP's data and time efficiency as well as strong generalization capabilities.

Paper Structure

This paper contains 42 sections, 15 equations, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Comparison of our DOP and token reduction: (a) Computational overhead in MLLM decoders primarily stems from MHA and MLP modules processing excessive tokens. (b) Token reduction chen2024imagezhang2024visprunerzhang2025cdpruner has all modules process the same token load regardless of structural redundancy differences. (c) Our DOP prunes redundant operations and saves computational budget for critical modules to process more tokens than uniform allocation.
  • Figure 2: Overview of DOP. (a) DOP decomposes MLLM decoder computations into operations defined by $(group, layer, module)$ as atomic computation carriers. Tokens are categorized into four groups based on input sources, and we prune operations for the visual token group in MHA and MLP modules. (b) DOP employs depth-wise pruning: visual tokens bypass MHA operations beyond layer $d_A$ and MLP operations beyond layer $d_P$, with remaining operations processing $n_v$ selected visual tokens. (c) DOP reduces optimization cost through additive approximation by replacing joint divergence with the sum of individual divergences measured by independently reducing $d_A$, $d_P$, $n_v$ and computing respective KL divergences from original model's output probability distribution. Evaluation points are sparsely sampled, with other points obtained through interpolation.
  • Figure 3: Overview of DOP performance across LLaVA-1.5-7B&13B liu2024visual, LLaVA-Next-7B&13B liu2024llavanext, Qwen2.5-VL-7B Qwen2.5-VL and InternVL3-8B zhu2025internvl3. Rel. Avg. is the mean performance ratio relative to the original model, averaged across benchmarks used for each model. DOP is flexible with various token reduction methods and consistently boosts the performance of corresponding baselines.
  • Figure 4: Individual divergence patterns across different benchmarks on LLaVA-1.5-7B. We visualize the three individual divergences $\mathcal{\hat{D}}_A(d_A)$, $\mathcal{\hat{D}}_P(d_P)$, and $\mathcal{\hat{D}}_v(n_v)$ with respect to MHA depth, MLP depth, and visual token count respectively. Each subplot shows divergences evaluated on three benchmarks, including TextVQA (blue), VQAv2 (green), and SeedBench (orange). Notably, SeedBench exhibits significantly higher redundancy in deeper layer operations compared to the other two benchmarks.