Table of Contents
Fetching ...

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

Haowei Gu, Weihao Zhu, Yang Yang

TL;DR

The paper tackles temporal sound localisation in video by shifting emphasis toward audio information. It proposes a multimodal pipeline that fuses video representations from UMT-Large and VideoMAE-Large with richer audio cues from BEATS and Cav-MAE variants, aligned via interpolation and combined through early fusion. The localization backbone is Actionformer, operating on a multi-scale transformer feature pyramid, with post-processing via a modified Weighted Boxes Fusion to refine predictions. On the ECCV Second Perception Test Challenge track3 dataset, the method achieves a test mAP of 0.4925, ranking first, demonstrating that targeted enhancement of audio features can significantly improve TSL performance in real-world videos.

Abstract

This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

TL;DR

The paper tackles temporal sound localisation in video by shifting emphasis toward audio information. It proposes a multimodal pipeline that fuses video representations from UMT-Large and VideoMAE-Large with richer audio cues from BEATS and Cav-MAE variants, aligned via interpolation and combined through early fusion. The localization backbone is Actionformer, operating on a multi-scale transformer feature pyramid, with post-processing via a modified Weighted Boxes Fusion to refine predictions. On the ECCV Second Perception Test Challenge track3 dataset, the method achieves a test mAP of 0.4925, ranking first, demonstrating that targeted enhancement of audio features can significantly improve TSL performance in real-world videos.

Abstract

This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.
Paper Structure (9 sections, 1 figure, 1 table)

This paper contains 9 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: This network utilizes a multimodal fusion approach for action detection. Video features are extracted using the UMT model, while audio features are generated from the BEATS model and two variants of CAV-MAE, fine-tuned on AudioSet and VGGSound, respectively. The audio outputs are concatenated to form a comprehensive audio representation. Subsequently, the fused video and audio features are processed by the Actionformer network, with the final action detection results refined through a post-processing step using Weighted Box Fusion.