Table of Contents
Fetching ...

TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos

Xiangrui Liu, Minghao Qin, Yan Shu, Zhengyang Liang, Yang Tian, Chen Jason Zhang, Bo Zhao, Zheng Liu

TL;DR

This work defines Task-oriented Temporal Grounding (ToTG) to locate implicit task-relevant intervals in long videos, beyond explicit temporal descriptions. It presents TimeScope, a progressive coarse-to-fine grounding framework using Holistic and Detailed Video Representations, guided by chain-of-thought reasoning and streaming memory for efficiency. To support extensive evaluation, the authors introduce ToTG-Bench and ToTG-Pile, a diverse benchmark and a large CoT-annotated training corpus, respectively. Empirical results show TimeScope achieves superior grounding precision, strong generalization across benchmarks, and meaningful improvements to downstream LVU tasks, with comprehensive ablations validating the design choices. The work provides open resources to foster further research in task-oriented temporal grounding for long-video understanding.

Abstract

Identifying key temporal intervals within long videos, known as temporal grounding (TG), is important to video understanding and reasoning tasks. In this paper, we introduce a new form of the temporal grounding problem, \textbf{Task-oriented Temporal Grounding} (\textbf{ToTG}), which is driven by the requirements of downstream tasks rather than explicit time-interval descriptions. For example, a ToTG input may be "explain why the man in the video is sent to the hospital," whereas traditional TG would take an explicit temporal description such as "the moments when the man is tripped by a stone and falls to the ground." This new ToTG formulation presents significant challenges for existing TG methods, as it requires jointly performing deep task comprehension and fine-grained temporal localization within long videos. To address these challenges, we conduct a systematic set of studies. First, we construct \textbf{a new benchmark ToTG-Bench}, which comprehensively evaluates ToTG performance across diverse settings. Second, we introduce \textbf{a new temporal-ground method TimeScope}, which performs coarse-to-fine localization through a progressive reasoning process. Leveraging extensive supervised fine-tuning with carefully curated chain-of-thought (CoT) data from a variety of scenarios, TimeScope generalizes effectively across tasks and domains. Our evaluation demonstrates \textbf{TimeScope's empirical advantages} over existing baselines from three perspectives: (1) substantial improvements in grounding precision, (2) significant benefits to downstream tasks, and (3) strong generalizability across different scenarios. All models, datasets, and source code will be fully open-sourced to support future research in this area.

TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos

TL;DR

This work defines Task-oriented Temporal Grounding (ToTG) to locate implicit task-relevant intervals in long videos, beyond explicit temporal descriptions. It presents TimeScope, a progressive coarse-to-fine grounding framework using Holistic and Detailed Video Representations, guided by chain-of-thought reasoning and streaming memory for efficiency. To support extensive evaluation, the authors introduce ToTG-Bench and ToTG-Pile, a diverse benchmark and a large CoT-annotated training corpus, respectively. Empirical results show TimeScope achieves superior grounding precision, strong generalization across benchmarks, and meaningful improvements to downstream LVU tasks, with comprehensive ablations validating the design choices. The work provides open resources to foster further research in task-oriented temporal grounding for long-video understanding.

Abstract

Identifying key temporal intervals within long videos, known as temporal grounding (TG), is important to video understanding and reasoning tasks. In this paper, we introduce a new form of the temporal grounding problem, \textbf{Task-oriented Temporal Grounding} (\textbf{ToTG}), which is driven by the requirements of downstream tasks rather than explicit time-interval descriptions. For example, a ToTG input may be "explain why the man in the video is sent to the hospital," whereas traditional TG would take an explicit temporal description such as "the moments when the man is tripped by a stone and falls to the ground." This new ToTG formulation presents significant challenges for existing TG methods, as it requires jointly performing deep task comprehension and fine-grained temporal localization within long videos. To address these challenges, we conduct a systematic set of studies. First, we construct \textbf{a new benchmark ToTG-Bench}, which comprehensively evaluates ToTG performance across diverse settings. Second, we introduce \textbf{a new temporal-ground method TimeScope}, which performs coarse-to-fine localization through a progressive reasoning process. Leveraging extensive supervised fine-tuning with carefully curated chain-of-thought (CoT) data from a variety of scenarios, TimeScope generalizes effectively across tasks and domains. Our evaluation demonstrates \textbf{TimeScope's empirical advantages} over existing baselines from three perspectives: (1) substantial improvements in grounding precision, (2) significant benefits to downstream tasks, and (3) strong generalizability across different scenarios. All models, datasets, and source code will be fully open-sourced to support future research in this area.

Paper Structure

This paper contains 29 sections, 3 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: In traditional temporal grounding, the target is explicit and can be located via simple semantic matching, whereas task-oriented temporal grounding requires identifying an implicit target essential for completing the task.
  • Figure 1: Different and Other Grounding-QA. Red denotes the hint to the answer span in the question, whereas blue marks the corresponding part in the answer. Task-oriented questions contain no cue about the target span, while all other grounding-QA tasks do.
  • Figure 2: Overview of TimeScope. The input long video is processed to generate two representations: the Holistic Video Representation (HVR), which captures global context, and the Fine Video Representation (FVR), which retains detailed local information. TimeScope first performs coarse-grained reasoning using HVR to narrow the search space, and then refines the localization using FVR within the identified temporal interval to achieve precise task-oriented localization.
  • Figure 2: Statistics analysis of ToTG-bench. (Left) Our benchmark covers distinct task types and 35 video categories. (Middle) Video duration and question center distributions. (Right) Performance of various model on ToTG-bench.
  • Figure 3: Visualization of TimeScope.
  • ...and 5 more figures