Table of Contents
Fetching ...

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang

TL;DR

MMG-Vid addresses the high cost of video token processing in VLLMs by introducing a training-free pruning framework that maximizes marginal gains at both segment- and token-level. It leverages similarity-based frame segmentation, dynamic per-segment budgeting, and temporal-guided density peak clustering (TG-DPC) to prune tokens in a temporally coherent manner. Across multiple benchmarks and LLaVA-based systems, MMG-Vid maintains near-original accuracy with substantial token reductions (up to 75%) and achieves notable speedups in prefilling. This approach enables practical, scalable deployment of VLLMs in real-world settings by effectively balancing accuracy and efficiency.

Abstract

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

TL;DR

MMG-Vid addresses the high cost of video token processing in VLLMs by introducing a training-free pruning framework that maximizes marginal gains at both segment- and token-level. It leverages similarity-based frame segmentation, dynamic per-segment budgeting, and temporal-guided density peak clustering (TG-DPC) to prune tokens in a temporally coherent manner. Across multiple benchmarks and LLaVA-based systems, MMG-Vid maintains near-original accuracy with substantial token reductions (up to 75%) and achieves notable speedups in prefilling. This approach enables practical, scalable deployment of VLLMs in real-world settings by effectively balancing accuracy and efficiency.

Abstract

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

Paper Structure

This paper contains 20 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of different budget allocation strategies. (a) Static uniform budgeting overlooks varying segment significance. (b) Static importance-based budgeting acknowledges the significance of segments but wastes resources by allocating redundant budgets to two visually similar static segments (the first and third). (c) Our marginal gain-based segment budgeting further reduces redundancy by penalizing a segment's budget if its information is already included in previously selected segments. This results in an optimal allocation of the budget that takes into account the dynamic characteristics.
  • Figure 2: Overall framework.Segment-level (Bottom-Right): We iteratively calculate the marginal gain (a combination of representativeness and diversity) for each segment to dynamically allocate budget, prioritizing more informative segments. Token-level (Top-Right): Our proposed TG-DPC progressively prunes each frame by selecting tokens that are both salient within the frame and novel across the temporal dimension, guided by the set of previously selected tokens
  • Figure 3: Ablation study of MMG-Vid's modules on LLaVA-Video (Retention Ratio: 25%). "DPC-KNN" refers to using the standard DPC-KNN algorithm instead of our proposed TG-DPC. "Uniform Budget" refers to the conventional method of assigning a fixed budget to each frame.