Table of Contents
Fetching ...

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, Yutong Gao

TL;DR

VideoMiner tackles long-video grounding by building an adaptive hierarchical tree through iterative segmentation, captioning, and clustering, then grounding key frames via a tree-aware reinforcement learner. It introduces T-GRPO, which assigns node- and tree-level rewards and employs a tree-growth auxin mechanism to balance accuracy and efficiency. Across EgoSchema, MLVU, Video-MME, and LongVideoBench, VideoMiner achieves state-of-the-art results on long-video understanding and maintains strong performance on shorter videos, while encouraging chain-of-thought reasoning in the policy. This approach advances practical long-video QA by preserving temporal structure, reducing redundancy, and providing interpretable grounding and reasoning.

Abstract

Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

TL;DR

VideoMiner tackles long-video grounding by building an adaptive hierarchical tree through iterative segmentation, captioning, and clustering, then grounding key frames via a tree-aware reinforcement learner. It introduces T-GRPO, which assigns node- and tree-level rewards and employs a tree-growth auxin mechanism to balance accuracy and efficiency. Across EgoSchema, MLVU, Video-MME, and LongVideoBench, VideoMiner achieves state-of-the-art results on long-video understanding and maintains strong performance on shorter videos, while encouraging chain-of-thought reasoning in the policy. This approach advances practical long-video QA by preserving temporal structure, reducing redundancy, and providing interpretable grounding and reasoning.

Abstract

Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.

Paper Structure

This paper contains 32 sections, 15 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of spatial-temporal related Q&A performance on long videos. The input is a 51-minute video with a question about athletes' actions before the match. Baselines provide answers such as changing clothes or dancing. Our method, incentivizing chain-of-thought ability by reinforcement learning, correctly identifies the act of singing the national anthem by locating key frames. The right side of the figure shows our superior performance against multiple baselines across both short and long videos.
  • Figure 2: Illustration of the workflow of our proposed VideoMiner. The long video undergoes iterative segmentation, captioning, and clustering to construct a hierarchical tree structure. The policy model governs the exploration of tree nodes and identifies key frames. The selected key frames, along with the original question, are then fed into the VLM for long-video reasoning, producing the final answer.
  • Figure 3: Illustration of the proposed T-GRPO. To highlight the differences from GRPO, we visualize the original GRPO components in gray, while newly introduced components are marked in red. Unlike GRPO, which primarily optimizes the final output, our approach focuses on the tree generation process, including node exploration behavior. To adapt to the hierarchical structure and video understanding tasks, we modify the tree framework and redesign the reward function accordingly.
  • Figure 4: Ablation study of clustering and reinforcement learning methods. (a) evaluates the impact of different clustering methods on accuracy and efficiency, while (b) analyzes the effect of various reinforcement learning approaches on accuracy.
  • Figure 5: Case study of the proposed VideoMiner. We present the tree node exploring path and the detailed reasoning process. Our proposed T-GRPO incentivizes the chain of thought of policy model, boosting reasoning ability of LLMs.
  • ...and 1 more figures