Table of Contents
Fetching ...

Enhancing Sound Source Localization via False Negative Elimination

Zengjie Song, Jiangshe Zhang, Yuxi Wang, Junsong Fan, Zhaoxiang Zhang

TL;DR

This work tackles false negatives in contrastive audio-visual learning for sound source localization by introducing two complementary schemes: SSPL, a negative-free self-supervised predictive learning method, and SACL, a semantic-aware contrastive framework that compacts visual features with pseudo masks and actively detects and excludes false negatives. A predictive coding module (PCM) enhances SSPL by iteratively aligning cross-modal features, while SACL provides a calibrated contrastive objective with reliable anchors and negatives. Empirically, the approach achieves state-of-the-art localization on SoundNet-Flickr and strong results on VGG-SS, with notable generalizability to audio-visual event classification and self-supervised object detection. The findings demonstrate that reducing semantic ambiguity in negative sampling and combining negative-free and negative-aware learning yields more faithful audio-visual alignment and robust localization in real-world scenarios.

Abstract

Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.

Enhancing Sound Source Localization via False Negative Elimination

TL;DR

This work tackles false negatives in contrastive audio-visual learning for sound source localization by introducing two complementary schemes: SSPL, a negative-free self-supervised predictive learning method, and SACL, a semantic-aware contrastive framework that compacts visual features with pseudo masks and actively detects and excludes false negatives. A predictive coding module (PCM) enhances SSPL by iteratively aligning cross-modal features, while SACL provides a calibrated contrastive objective with reliable anchors and negatives. Empirically, the approach achieves state-of-the-art localization on SoundNet-Flickr and strong results on VGG-SS, with notable generalizability to audio-visual event classification and self-supervised object detection. The findings demonstrate that reducing semantic ambiguity in negative sampling and combining negative-free and negative-aware learning yields more faithful audio-visual alignment and robust localization in real-world scenarios.

Abstract

Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.
Paper Structure (39 sections, 15 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 15 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Effect of false negatives on sound localization. For two images that are visually similar but differ in local content (e.g., hand pose), (a) ambiguous localization is observed when performing contrastive learning Senocak18 with false negatives (i.e., audio samples that have the same class label with the positive). (b) Consistent localization can be obtained without sampling false negatives, but requiring class label as guidance. By contrast, our method mitigates the false negative problem via (c) self-supervised predictive learning (SSPL), which only relies on image-audio positive pair mining; or (d) semantic-aware contrastive learning (SACL), which proceeds with more reliable multimodal features by compacting visual features and meanwhile removing potential false negatives. Experiments are performed on MUSIC Zhao18. See Section \ref{['sec:ablation_sacl_false_neg']} for quantitative comparisons.
  • Figure 2: Illustration of our sound source localization method. The overall framework in (a) builds on a three-stream network: top and bottom streams for visual feature extraction and middle steam for audio signal processing. The audio-visual correspondence for sound localization is captured by two alternative learning schemes: self-supervised predictive learning (SSPL, Section \ref{['sec:sspl']}) and semantic-aware contrastive learning (SACL, Section \ref{['sec:sacl']}). SSPL in (b) is a negative-free approach, which aims to associate sound source with different augmented views of one image by mining image-audio pairs from the same video clip. To this end, the attention module (AM, Section \ref{['sec:sspl_am']}) in (c) for feature integration and the predictive coding module (PCM, Section \ref{['sec:sspl_pcm']}) in (d) for feature alignment are introduced. SACL in (e) is a contrastive learning method and works to find reliable anchor and negative features for contrast. The anchor point is determined by compacting visual features with pseudo mask and similarity map (Section \ref{['sec:sacl_compact_vis_feat']}), while the effective negatives are selected through the false negative detection (FND, Section \ref{['sec:sacl_detect_fasle_neg']}) module in (f).
  • Figure 3: Example pseudo masks used in SACL. In addition to the image-computable FH masks (6th column), we also consider spatial heuristics that partition the image into $2\times2$, $4\times4$, $8\times8$ grids (3rd-5th columns, respectively). Note that a $1\times1$ grid (2nd column) is equivalent to only using the binary similarity map to compact visual features.
  • Figure 4: Success ration with varying cIoU thresholds. Best viewed in color and by zooming in.
  • Figure 5: Qualitative comparisons. In each panel, the first column shows images accompanied with annotations, and remaining columns represent the predicted localization of sounding objects. Here the attention map or similarity map produced by different methods is visualized as the localization map. Note that for SoundNet-Flickr the bounding boxes are derived from multiple annotators.
  • ...and 7 more figures