Table of Contents
Fetching ...

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

Inho Kim, Youngkil Song, Jicheol Park, Won Hwa Kim, Suha Kwak

TL;DR

This work tackles sound source localization in image-audio pairs under self-supervision by introducing joint slot attention (JSA) that jointly decomposes image and audio features into target and off-target slots. Target slots drive contrastive learning via InfoNCE and enable pixel-level localization through cross-modal attention ca^{a,v}, while cross-modal attention matching aligns intra- and inter-modal attentions to improve feature-level correspondence. A false-negative mitigation strategy using $k$-reciprocal nearest neighbors and a slot divergence loss further enhance slot distinctiveness and learning stability, complemented by a reconstruction loss during training. Inference leverages the cross-modal attention map, refined with Image-Query based Refinement (IQR) to produce accurate spatial localization without external priors. The method achieves state-of-the-art results on three benchmarks (Flickr-SoundNet, VGG-Sound, AVSBench) and substantially improves cross-modal retrieval, demonstrating strong multi-modal understanding beyond saliency detection and enabling robust SSL in noisy real-world data.

Abstract

Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

TL;DR

This work tackles sound source localization in image-audio pairs under self-supervision by introducing joint slot attention (JSA) that jointly decomposes image and audio features into target and off-target slots. Target slots drive contrastive learning via InfoNCE and enable pixel-level localization through cross-modal attention ca^{a,v}, while cross-modal attention matching aligns intra- and inter-modal attentions to improve feature-level correspondence. A false-negative mitigation strategy using -reciprocal nearest neighbors and a slot divergence loss further enhance slot distinctiveness and learning stability, complemented by a reconstruction loss during training. Inference leverages the cross-modal attention map, refined with Image-Query based Refinement (IQR) to produce accurate spatial localization without external priors. The method achieves state-of-the-art results on three benchmarks (Flickr-SoundNet, VGG-Sound, AVSBench) and substantially improves cross-modal retrieval, demonstrating strong multi-modal understanding beyond saliency detection and enabling robust SSL in noisy real-world data.

Abstract

Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.

Paper Structure

This paper contains 19 sections, 16 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Difference between previous work and ours. For contrastive learning of SSL, our method identifies and utilizes pairs of relevant image region and audio segment, while previous work exploits a local image feature and the global audio feature.
  • Figure 2: Overview of the proposed method. First, image and audio features are extracted by the image and audio encoders, and serve as keys and values for slot attention. In slot attention, two learnable slots are used as queries to draw attentions on the input features while competing with each other, decomposing the features into the target and off-target representations per modality. The model is then trained by contrastive learning using only the target representations, and by cross-modal attention matching for further feature-level alignment between the two modalities. For localization in testing, we draw the attention map between the audio target representation and image features. Some details for learning slot attention, such as decoders and loss for slot reconstruction, have been omitted for clarity.
  • Figure 3: Sound localization results on Flickr-SoundNet-Test soundnet_neurips16 and VGG-SS vggsound_icassp20. (a) Input image. (b) Ground-Truth. (c) Ours. (d) Alignment alignment_iccv23. (e) FNAC fnac_cvpr23. (f) EZ-VSL ezvsl_eccv22. The qualitative results are obtained by the model trained on Flickr-144k and the model trained on VGGSound-144k, respectively. Note that all visualizations are obtained without refinement.
  • Figure 4: Qualitative results to show the impact of $\mathcal{L}_{\text{match}}$ on Flickr-SoundNet. (a) Input image. (b) Ground-Truth. (c) Results without cross-modal attention matching. (d) Results with cross-modal attention matching.
  • Figure 5: Qualitative results of cross-modal retrieval on VGG-SS.
  • ...and 7 more figures