Table of Contents
Fetching ...

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations

Lu Dong, Haiyu Zhang, Han Lin, Ziang Yan, Xiangyu Zeng, Hongjie Zhang, Yifei Huang, Yi Wang, Zhen-Hua Ling, Limin Wang, Yali Wang

TL;DR

This work tackles VTG by addressing data quality and training difficulty through VideoTG-R1, a curriculum RL framework that combines a Boundary Reflection Agent (BRA) to filter partially annotated samples with a Difficulty Estimation Agent (DEA) and a progressive masking strategy. By filtering ambiguous data and progressively presenting easier-to-harder samples, VideoTG-R1 achieves data-efficient training and state-of-the-art results on VTG and grounded VideoQA benchmarks, even with only 10% of the training data and a fraction of the compute. The key contributions are the introduction of BR-based partial-annotation filtering, a zero-shot IoU-based difficulty metric, and a dynamic curriculum that modulates video input, all implemented within a GRPO-based reinforcement learning setup. The approach demonstrates strong practical impact for robust VTG in real-world, partially annotated datasets and highlights directions for further improvement in data quality assessment and semantically guided masking.

Abstract

Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at https://github.com/ldong1111/VideoTG-R1.

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations

TL;DR

This work tackles VTG by addressing data quality and training difficulty through VideoTG-R1, a curriculum RL framework that combines a Boundary Reflection Agent (BRA) to filter partially annotated samples with a Difficulty Estimation Agent (DEA) and a progressive masking strategy. By filtering ambiguous data and progressively presenting easier-to-harder samples, VideoTG-R1 achieves data-efficient training and state-of-the-art results on VTG and grounded VideoQA benchmarks, even with only 10% of the training data and a fraction of the compute. The key contributions are the introduction of BR-based partial-annotation filtering, a zero-shot IoU-based difficulty metric, and a dynamic curriculum that modulates video input, all implemented within a GRPO-based reinforcement learning setup. The approach demonstrates strong practical impact for robust VTG in real-world, partially annotated datasets and highlights directions for further improvement in data quality assessment and semantically guided masking.

Abstract

Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at https://github.com/ldong1111/VideoTG-R1.

Paper Structure

This paper contains 27 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview. VideoTG-R1 is a multi-agent system for efficient RL training. It contains three key modules to address the primary challenges in VTG. Boundary Reflection Agent quantifies the missing annotations and identifies the partially annotated samples. Difficulty Estimation Agent estimate the difficulty of each sample via zero-shot evaluation. Curriculum-RL strategy learns the hard samples in an easy-to-hard manner by dynamically masking the videos.
  • Figure 2: Boundary Reflection Agent. First, the annotated segments are removed from the original video. Then, a boundary reflection prompt augmented with a timestamp metadata and a grounding instruction is fed to an MLLM to estimate the total duration of query-relevant segments outside the annotated video. Finally, we can identify and discard partially annotated samples.
  • Figure 3: Left: Difficulty Estimation Agent. We perform zero-shot evaluation with an MLLM to estimate the difficulty of each sample, and then split the dataset into "hard" and "easy" samples based on their predicted IoUs. Right: Curriculum RL strategy. For hard-to-ground samples, we dynamically mask the segment outside the annotated video according to the training step, which eases training difficulty and provides clearer preference during the RL training process.
  • Figure 4: Out-of-domain evaluation across varying sample ratios. Models are trained with GRPO on the Charades-STA dataset, and tested on the ActivityNet-Captions dataset. Results are presented for a) R@0.5 and b) mIoU metrics.
  • Figure 5: a) Proportion of Partially Annotated Sample (PAS) in the manually labeled subset. We randomly select 100 samples from each of three datasets and annotate them. b) F1-score of PAS prediction using random selection and BRA. GT labels are drawn from the manually labeled subset.
  • ...and 6 more figures