Table of Contents
Fetching ...

The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023

Yurui Huang, Yang Yang, Shou Chen, Xiangyu Wu, Qingguo Chen, Jianfeng Lu

TL;DR

The paper tackles temporal sound localization by combining high-quality visual representations from VideoMAE V2 with audio embeddings (MMV) in an early-fusion multimodal pipeline. These fused features are processed by a multi-scale Transformer based on the Actionformer backbone to perform moment-level classification and boundary regression, with a concise two-term loss. The approach achieves a notable result of $\text{mAP}=0.33$, securing second place in the ICCV 1st Perception Test Challenge, and demonstrates that visual pretraining substantially boosts localization quality while multimodal fusion provides additional gains. This work highlights the practical benefits of integrating self-supervised visual features with audio cues for robust temporal localization in videos.

Abstract

In this paper, we propose a solution for improving the quality of temporal sound localization. We employ a multimodal fusion approach to combine visual and audio features. High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network, resulting in efficient video feature representations. At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds. The fused features are trained in a multi-scale Transformer for training. In the final test dataset, we achieved a mean average precision (mAP) of 0.33, obtaining the second-best performance in this track.

The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023

TL;DR

The paper tackles temporal sound localization by combining high-quality visual representations from VideoMAE V2 with audio embeddings (MMV) in an early-fusion multimodal pipeline. These fused features are processed by a multi-scale Transformer based on the Actionformer backbone to perform moment-level classification and boundary regression, with a concise two-term loss. The approach achieves a notable result of , securing second place in the ICCV 1st Perception Test Challenge, and demonstrates that visual pretraining substantially boosts localization quality while multimodal fusion provides additional gains. This work highlights the practical benefits of integrating self-supervised visual features with audio cues for robust temporal localization in videos.

Abstract

In this paper, we propose a solution for improving the quality of temporal sound localization. We employ a multimodal fusion approach to combine visual and audio features. High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network, resulting in efficient video feature representations. At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds. The fused features are trained in a multi-scale Transformer for training. In the final test dataset, we achieved a mean average precision (mAP) of 0.33, obtaining the second-best performance in this track.
Paper Structure (8 sections, 3 equations, 1 figure, 1 table)

This paper contains 8 sections, 3 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: The visual and audio modalities are inputted, with the visual modality using VideoMAE V2 for feature extraction, and the audio modality utilizing MMV for feature extraction. These features are then fed into a multi-scale Actionformer.