Table of Contents
Fetching ...

Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

Runhao Zeng, Xiaoyong Chen, Jiaming Liang, Huisi Wu, Guangzhong Cao, Yong Guo

TL;DR

This work addresses the robustness of temporal action detection (TAD) models under temporal corruptions that disrupt frame-wise continuity in untrimmed videos. It introduces two benchmarks, THUMOS14-C and ActivityNet-v1.3-C, with five corruption types across three severities, and proposes FrameDrop augmentation plus Temporal-Robust Consistency (TRC) loss to defend against these corruptions. The study finds that current TAD methods are particularly vulnerable to temporal corruptions—primarily due to localization errors—with central-frame corruptions being the strongest attack. A simple yet effective training strategy that combines FrameDrop and TRC improves robustness across models and datasets and can even boost clean-data performance, advocating for robust evaluation as a standard practice in TAD research.

Abstract

Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.

Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

TL;DR

This work addresses the robustness of temporal action detection (TAD) models under temporal corruptions that disrupt frame-wise continuity in untrimmed videos. It introduces two benchmarks, THUMOS14-C and ActivityNet-v1.3-C, with five corruption types across three severities, and proposes FrameDrop augmentation plus Temporal-Robust Consistency (TRC) loss to defend against these corruptions. The study finds that current TAD methods are particularly vulnerable to temporal corruptions—primarily due to localization errors—with central-frame corruptions being the strongest attack. A simple yet effective training strategy that combines FrameDrop and TRC improves robustness across models and datasets and can even boost clean-data performance, advocating for robust evaluation as a standard practice in TAD research.

Abstract

Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.
Paper Structure (28 sections, 1 equation, 9 figures, 14 tables)

This paper contains 28 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: The mAP gap of temporal action detection methods when testing on clean and corrupted videos. * and # denote the video features extracted by I3D and VideoMAEv2, respectively. Other methods follow an end-to-end manner. Existing TAD methods incur a significant mAP drop of more than 1.08% even when only one frame is corrupted in an action instance on THUMOS14 dataset, highlighting a prevailing lack of robustness towards temporal corruptions.
  • Figure 2: The gain of mAP and relative robustness brought by our proposed training strategy. * and # denote the video features extracted by I3D and VideoMAEv2, respectively. Our method enhances TAD models' robustness on corrupted videos and, surprisingly, boosts their performance on clean videos.
  • Figure 3: Our temporal robustness study introduces 5 types of temporal corruptions that are frequently encountered in real-world scenarios, including black frame choi2015automated, motion blur, overexposure, occlusion and packet loss yi2021benchmarking. Each type of corruptions has 3 levels of severity and each level refers to the $l\%$ ($l \in \{1,5,10\}$) action center frames being corrupted, eventually resulting in 15 distinct corruptions.
  • Figure 4: False positive profiling of the TriDet's predictions on THUMOS14-C. The Wrong Label (classification) Error remains relatively consistent, whereas the Localization Error increases significantly on corrupted data, revealing that vulnerability mainly comes from localization error rather than classification error.
  • Figure 5: The performance of TAD models with varying corruption locations within an action instance on THUMOS14-C. The horizontal dashed lines refer to the model's performance on clean videos. As corruptions approach the center, its impact on the model becomes increasingly significant.
  • ...and 4 more figures