The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023
Yurui Huang, Yang Yang, Shou Chen, Xiangyu Wu, Qingguo Chen, Jianfeng Lu
TL;DR
The paper tackles temporal sound localization by combining high-quality visual representations from VideoMAE V2 with audio embeddings (MMV) in an early-fusion multimodal pipeline. These fused features are processed by a multi-scale Transformer based on the Actionformer backbone to perform moment-level classification and boundary regression, with a concise two-term loss. The approach achieves a notable result of $\text{mAP}=0.33$, securing second place in the ICCV 1st Perception Test Challenge, and demonstrates that visual pretraining substantially boosts localization quality while multimodal fusion provides additional gains. This work highlights the practical benefits of integrating self-supervised visual features with audio cues for robust temporal localization in videos.
Abstract
In this paper, we propose a solution for improving the quality of temporal sound localization. We employ a multimodal fusion approach to combine visual and audio features. High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network, resulting in efficient video feature representations. At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds. The fused features are trained in a multi-scale Transformer for training. In the final test dataset, we achieved a mean average precision (mAP) of 0.33, obtaining the second-best performance in this track.
