Table of Contents
Fetching ...

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji

Abstract

Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

Abstract

Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.
Paper Structure (15 sections, 11 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between the proposed ForestPrune (Ours) and the existing compression methods jiang2025kindfastvyang2025visionzipfu2024framefusion on VideoMME. The base model used is LLaVA-Video-7B. As pruning ratio increases, existing methods will encounter obvious performance drops, while ForestPrune is still robust.
  • Figure 2: Visualization of token pruning by G-Prune jiang2025kind and our ForestPrune. The image-centric G-Prune can well keep the important tokens for each frame, but also leads to obvious redundancy across frames. In contrast, our ForestPrune can obtain a globally optimal pruning via spatial-temporal forest modeling.
  • Figure 3: Illustration of the proposed ForestPrune. Input video frames are first encoded by the visual encoder, based on which ForestPrune will select a set of tokens of each frame as the candidate nodes. Afterwards, ForestPrune constructs the token trees based on the semantic similarity $\tau_s$, spatial distance $\tau_p$ and the frame temporal orders, thereby forming the spatial-temporal forest (a). When obtaining excessive trees (root nodes), we will merge them before pruning (b). Then, we sort the trees in a descending order of depth (c) and then progressively prune the leaf and tail nodes until meeting the compression budget (d). Via the spatial-temporal modeling, ForestPrune can well estimate the frame-wise redundancy and obtain a globally optimal pruning decision.
  • Figure 4: Efficiency comparison between ForestPrune and three existing methods. It is using LLaVA-Video with 90% compression ratio. GPU memory records the peak usage.
  • Figure 5: Visualized results of ForestPrune. Subfigure-(a) shows the spatial-temporal tree built by ForestPrune and the compression results, which show ForestPrune's global spatial-temporal modeling capabilities. Subfigure-(b) shows the compression results of ForestPrune, G-Prune, and VisionZip, showcasing ForestPrune's ability to reduce temporal redundancy compared to image compression methods. The ORANGE and BLUE trees are the spatial-temporal trees. Frames with scene changes occur are shown with GREEN boxes.