The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024
Yinan Han, Qingyuan Jiang, Hongming Mei, Yang Yang, Jinhui Tang
TL;DR
The paper tackles Temporal Action Localisation (TAL) in untrimmed videos within the Perception Test Challenge 2024. It presents a multimodal TAL pipeline that uses state-of-the-art video feature extractors (UMT and VideoMAEv2) and audio feature extractors (BEATs and CAV-MAE), augmented with overlapping labels from Something-SomethingV2 to improve generalization. The approach trains both multimodal and unimodal models and fuses their predictions with Weighted Box Fusion to enhance localisation robustness. The method achieves first place with a reported score of 0.5498, demonstrating the effectiveness of data augmentation and multimodal fusion for TAL in real-world perception tasks.
Abstract
This report presents our method for Temporal Action Localisation (TAL), which focuses on identifying and classifying actions within specific time intervals throughout a video sequence. We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset, enhancing the model's ability to generalize across various action classes. For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features. Our approach involves training both multimodal (video and audio) and unimodal (video only) models, followed by combining their predictions using the Weighted Box Fusion (WBF) method. This fusion strategy ensures robust action localisation. our overall approach achieves a score of 0.5498, securing first place in the competition.
