Table of Contents
Fetching ...

TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos

Chen-Lin Zhang, Lin Sui, Shuming Liu, Fangzhou Mu, Zhangcheng Wang, Bernard Ghanem

TL;DR

TimeLoc presents a unified end-to-end framework for precise timestamp localization across diverse long-video tasks, unifying action localization, grounding, moment retrieval, and GEBD under a single, one-stage, anchor-free backbone. It introduces temporal chunking and temporal gradient checkpointing to enable end-to-end training on videos with tens of thousands of frames, and a multi-stage training regime to optimize text-conditioned localization by fine-tuning the text encoder before the video backbone. The approach yields state-of-the-art results across multiple benchmarks, demonstrating gains in mAP, Recall@k, and Rel.Dis without relying on frozen features, while maintaining efficiency. This work significantly advances cross-task generalization and practical long-video understanding, with broad implications for scalable video analysis systems.

Abstract

Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at https://github.com/sming256/TimeLoc.

TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos

TL;DR

TimeLoc presents a unified end-to-end framework for precise timestamp localization across diverse long-video tasks, unifying action localization, grounding, moment retrieval, and GEBD under a single, one-stage, anchor-free backbone. It introduces temporal chunking and temporal gradient checkpointing to enable end-to-end training on videos with tens of thousands of frames, and a multi-stage training regime to optimize text-conditioned localization by fine-tuning the text encoder before the video backbone. The approach yields state-of-the-art results across multiple benchmarks, demonstrating gains in mAP, Recall@k, and Rel.Dis without relying on frozen features, while maintaining efficiency. This work significantly advances cross-task generalization and practical long-video understanding, with broad implications for scalable video analysis systems.

Abstract

Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at https://github.com/sming256/TimeLoc.

Paper Structure

This paper contains 25 sections, 1 equation, 2 figures, 15 tables.

Figures (2)

  • Figure 1: TimeLoc achieves state-of-the-art performance across various temporal localization tasks, including temporal action localization, temporal sentence grounding, moment retrieval, and generic event boundary detection.
  • Figure 2: Pipeline and Multi-Stage Training Strategy of TimeLoc. (a) The left diagram illustrates our pipeline, which supports temporal localization tasks with or without additional text conditions. The pipeline first extracts features from each modality, applies fusion strategies (if applicable), and then utilizes a lightweight localization head to generate predictions. On the right, the multi-stage training strategy is shown. The training process is divided into two stages: (b) in Stage 1, only the text encoder, fusion module, and localization head are trained. (c) In Stage 2, the text encoder is frozen while the video encoder, fusion module, and localization head are trained.