Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions
Runhao Zeng, Xiaoyong Chen, Jiaming Liang, Huisi Wu, Guangzhong Cao, Yong Guo
TL;DR
This work addresses the robustness of temporal action detection (TAD) models under temporal corruptions that disrupt frame-wise continuity in untrimmed videos. It introduces two benchmarks, THUMOS14-C and ActivityNet-v1.3-C, with five corruption types across three severities, and proposes FrameDrop augmentation plus Temporal-Robust Consistency (TRC) loss to defend against these corruptions. The study finds that current TAD methods are particularly vulnerable to temporal corruptions—primarily due to localization errors—with central-frame corruptions being the strongest attack. A simple yet effective training strategy that combines FrameDrop and TRC improves robustness across models and datasets and can even boost clean-data performance, advocating for robust evaluation as a standard practice in TAD research.
Abstract
Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.
