Table of Contents
Fetching ...

TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She

TL;DR

TimeSearch-R tackles long-form video understanding by making temporal search an active, end-to-end learned component. It reformulates search as interleaved text-video thinking and introduces GRPO-CSV to supervise intermediate search steps, improving both exploration completeness and reasoning consistency. A two-stage dataset construction enables robust RL training, and extensive experiments show state-of-the-art results on Haystack benchmarks and strong gains on VideoMME, MLVU, and LongVideoBench, including a new SOTA on LongVideoBench. The approach enhances interpretability through explicit Thinking traces and proposes a scalable training paradigm with outcome-based process supervision for complex multimodal reasoning. Overall, TimeSearch-R demonstrates that end-to-end temporal search guided by reinforced, self-verified reasoning substantially improves both efficiency and accuracy in long-form video understanding.

Abstract

Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.

TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

TL;DR

TimeSearch-R tackles long-form video understanding by making temporal search an active, end-to-end learned component. It reformulates search as interleaved text-video thinking and introduces GRPO-CSV to supervise intermediate search steps, improving both exploration completeness and reasoning consistency. A two-stage dataset construction enables robust RL training, and extensive experiments show state-of-the-art results on Haystack benchmarks and strong gains on VideoMME, MLVU, and LongVideoBench, including a new SOTA on LongVideoBench. The approach enhances interpretability through explicit Thinking traces and proposes a scalable training paradigm with outcome-based process supervision for complex multimodal reasoning. Overall, TimeSearch-R demonstrates that end-to-end temporal search guided by reinforced, self-verified reasoning substantially improves both efficiency and accuracy in long-form video understanding.

Abstract

Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.

Paper Structure

This paper contains 58 sections, 7 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: (a) Different paradigms of temporal search. Previous works such as VideoAgent wang2024videoagent and T* ye2025rethinking predominantly rely on handcrafted workflows, resulting in suboptimal strategies. Our approach adopts end-to-end reinforcement learning, enabling the model to learn optimal search strategies directly from data. (b) Interleaved text-video thinking process. We reformulate the temporal search task as an interleaved text-video thinking process, where the temporal search is seamlessly interleaved into the reasoning process.
  • Figure 2: Two failure modes with the original GRPO reward. Left: Insufficient temporal exploration. The model misses critical frames required to correctly answer the question. Right: Inconsistent logical reasoning. The intermediate reasoning process contradicts the final answer.
  • Figure 3: Overall pipeline of GRPO-CSV. Building upon the original GRPO, CSV extracts a dynamic frame set from the multi-modal CoT and constructs a vision-only CoT for re-answering. This design verifies that the searched dynamic frames provide sufficient evidence for correct reasoning, ensuring completeness and consistency without requiring explicit frame-level supervision.
  • Figure 4: Ablation Results
  • Figure 5: Training Dynamics
  • ...and 13 more figures