LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Jihao Qiu; Lingxi Xie; Xinyue Huo; Qi Tian; Qixiang Ye

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye

TL;DR

The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation.

Abstract

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

TL;DR

Abstract

Paper Structure (30 sections, 11 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 11 equations, 9 figures, 10 tables, 1 algorithm.

Introduction
Related Work
On Efficient Long Video Understanding
What Makes Efficient Video Understanding?
LongVideo-R1 Framework
Chain-of-Thought-with-Tool Procedure
Data Curation
Data Preparation
Generating CoTwT Trajectories
Training LongVideo-R1 Agent
Supervised Fine-tuning
Reinforcement Learning with GRPO
Reward Design
Rollout and Optimization
Experiments
...and 15 more sections

Figures (9)

Figure 1: Motivation and performance comparison.Left: For efficient understanding of long video, the algorithm shall learn to fetch and perceive information effectively, where the core abilities are: (1) judging whether collected information is sufficient for answering, and (2) if not, navigating to the next clip that is most likely to contain useful information. Right: LongVideo-R1 achieves a better tradeoff compared to recent methods on the LVBench dataset wang2025lvbench. The marker size indicates model scale.
Figure 2: An illustration of generating CoTwT trajectories from clue-grounded video QA data.
Figure 3: LongVideo-R1 can navigate in ultra-long videos efficiently. We show an example in a long-form TV drama, A Lifelong Journey.
Figure 4: An example of how LongVideo-R1 smartly navigates to the critical segment and answers the question.
Figure 5: More example on ultra-long videos.
...and 4 more figures

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

TL;DR

Abstract

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (9)