Table of Contents
Fetching ...

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Shuming Liu, Chen Zhao, Fatimah Zohra, Mattia Soldan, Alejandro Pardo, Mengmeng Xu, Lama Alssum, Merey Ramazanova, Juan León Alcázar, Anthony Cioppa, Silvio Giancola, Carlos Hinojosa, Bernard Ghanem

TL;DR

OpenTAD addresses the lack of a standardized benchmarking framework for Temporal Action Detection (TAD) by introducing a unified PyTorch-based codebase that consolidates $16$ methods and $9$ datasets. The framework modularizes the detection pipeline into three stages: Stage 0 feature extraction, Stage 1 temporal aggregation with initial predictions, and an optional Stage 2 RoI-based refinement, supporting both feature-based and end-to-end training through a plug-and-play architecture. Through extensive ablations on neck designs, RoI strategies, backbone choices, data processing, and Stage 2 usage, the study identifies key design choices that most influence performance and demonstrates state-of-the-art results on popular benchmarks such as ActivityNet-v1.3 and THUMOS-14. The authors also release code and pretrained models to enable reproducible, fair comparisons and accelerate future research in video understanding.

Abstract

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose \textbf{OpenTAD}, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. We have made our code and models available at https://github.com/sming256/OpenTAD.

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

TL;DR

OpenTAD addresses the lack of a standardized benchmarking framework for Temporal Action Detection (TAD) by introducing a unified PyTorch-based codebase that consolidates methods and datasets. The framework modularizes the detection pipeline into three stages: Stage 0 feature extraction, Stage 1 temporal aggregation with initial predictions, and an optional Stage 2 RoI-based refinement, supporting both feature-based and end-to-end training through a plug-and-play architecture. Through extensive ablations on neck designs, RoI strategies, backbone choices, data processing, and Stage 2 usage, the study identifies key design choices that most influence performance and demonstrates state-of-the-art results on popular benchmarks such as ActivityNet-v1.3 and THUMOS-14. The authors also release code and pretrained models to enable reproducible, fair comparisons and accelerate future research in video understanding.

Abstract

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose \textbf{OpenTAD}, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. We have made our code and models available at https://github.com/sming256/OpenTAD.

Paper Structure

This paper contains 22 sections, 1 figure, 14 tables.

Figures (1)

  • Figure 1: Unified TAD Pipeline. Recent TAD methods follow this three-step framework to predict action classes and start/end timestamps from input videos. 1) Stage 0: Videos are encoded into features using a pretrained video backbone, which may be either fine-tuned or frozen during training. 2) Stage 1: This stage consists of a neck for temporal aggregation of snippet-level features and a dense head that generates snippet-level predictions. 3) Stage 2 (optional): This stage further refines action segment proposals using RoI extraction and produces per-proposal predictions.