Table of Contents
Fetching ...

HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

Yansong Guo, Chaoyang Zhu, Jiayi Ji, Jianghang Lin, Liujuan Cao

Abstract

Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

Abstract

Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

Paper Structure

This paper contains 28 sections, 12 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of existing VideoLLM pruning methods and our approach under different pruning ratios on LLaVA-Video (top) and LLaVA-OneVision (bottom). "mc" and "oe" represent the multiple choice and open ended settings in NExT-QA. "w/o sub." and "w/ sub." are the VideoMME settings without and with subtitles. Our model HieraVid is superior across all benchmarks with diverse durations, complexities, and pruning ratios.
  • Figure 2: Framework of HieraVid. To balance visual information loss from shallow-layer pruning and inefficiency from deep-layer pruning, we apply layer-level multi-stage pruning to the LLM. HieraVid divides the pruning process into three stages. (i) Merge Ratio-guided Segmentation: At the LLM input layer, static tokens at corresponding positions across merged frames are temporally combined by merging similar tokens at their first occurrence. Furthermore, we partition frames into multiple segments based on the merge ratio between adjacent frames to ensure inter-segment diversity and intra-segment continuity (see Figure \ref{['fig:visualization_segment']}). (ii) Segment Budget Allocation dynamically assigns pruning ratios to each segment to maximize visual diversity after pruning. (iii) Frame-level DPP Pruning integrates the DPP kernel matrix with instruction features to balance visual diversity and instruction relevance in the pruned tokens.
  • Figure 3: Visualization of frame segment results by our HieraVid. The horizontal dashed line represents the segment threshold. Lower bars indicate lower similarity with preceding frames. Black borders mark the segmentation boundaries, where frames within each segment exhibit high similarity, while those across segments maintain distinctiveness.
  • Figure 4: Ablation Experiments on segment threshold $\beta$.