Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

Haowei Gu; Weihao Zhu; Yang Yang

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

Haowei Gu, Weihao Zhu, Yang Yang

TL;DR

The paper tackles temporal sound localisation in video by shifting emphasis toward audio information. It proposes a multimodal pipeline that fuses video representations from UMT-Large and VideoMAE-Large with richer audio cues from BEATS and Cav-MAE variants, aligned via interpolation and combined through early fusion. The localization backbone is Actionformer, operating on a multi-scale transformer feature pyramid, with post-processing via a modified Weighted Boxes Fusion to refine predictions. On the ECCV Second Perception Test Challenge track3 dataset, the method achieves a test mAP of 0.4925, ranking first, demonstrating that targeted enhancement of audio features can significantly improve TSL performance in real-world videos.

Abstract

This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

TL;DR

Abstract

Paper Structure (9 sections, 1 figure, 1 table)

This paper contains 9 sections, 1 figure, 1 table.

Introduction
Method
Preliminary
Overall Architecture
Multimodal Feature Extraction
Temporal Sound Localization
Post-Processing
Experiment
Conclusion

Figures (1)

Figure 1: This network utilizes a multimodal fusion approach for action detection. Video features are extracted using the UMT model, while audio features are generated from the BEATS model and two variants of CAV-MAE, fine-tuned on AudioSet and VGGSound, respectively. The audio outputs are concatenated to form a comprehensive audio representation. Subsequently, the fused video and audio features are processed by the Actionformer network, with the final action detection results refined through a post-processing step using Weighted Box Fusion.

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

TL;DR

Abstract

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

Authors

TL;DR

Abstract

Table of Contents

Figures (1)