Table of Contents
Fetching ...

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

TL;DR

This work foregrounds cross-modal interaction as a core component of sound source localization, introducing IS3 as an interactive synthetic benchmark and two new metrics, Adaptive cIoU and IIoU, to evaluate localization alongside cross-modal interactivity. It proposes a cross-modal alignment framework that learns from multiple positive audio-visual samples—both hand-crafted augmentations and semantically similar priors—using dual losses for localization and semantic alignment via $s_L$ and $s_A$. The approach achieves state-of-the-art results across localization benchmarks, enables robust cross-modal retrieval, and demonstrates interactive localization capabilities on multi-source scenes and segmentation tasks, highlighting the importance of semantic alignment and multi-view supervision. Overall, the method provides a comprehensive, multi-task evaluation of audio-visual interaction and localization, with strong implications for practical audio-visual perception systems.

Abstract

Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability. Third, we propose a learning framework with a cross-modal alignment strategy to enhance cross-modal interaction. Lastly, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks together to thoroughly assess cross-modal interaction capabilities and benchmark competing methods. Our new benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. Our proposed novel method, with enhanced cross-modal alignment, shows superior sound source localization performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using new and standard evaluation metrics.

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

TL;DR

This work foregrounds cross-modal interaction as a core component of sound source localization, introducing IS3 as an interactive synthetic benchmark and two new metrics, Adaptive cIoU and IIoU, to evaluate localization alongside cross-modal interactivity. It proposes a cross-modal alignment framework that learns from multiple positive audio-visual samples—both hand-crafted augmentations and semantically similar priors—using dual losses for localization and semantic alignment via and . The approach achieves state-of-the-art results across localization benchmarks, enables robust cross-modal retrieval, and demonstrates interactive localization capabilities on multi-source scenes and segmentation tasks, highlighting the importance of semantic alignment and multi-view supervision. Overall, the method provides a comprehensive, multi-task evaluation of audio-visual interaction and localization, with strong implications for practical audio-visual perception systems.

Abstract

Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability. Third, we propose a learning framework with a cross-modal alignment strategy to enhance cross-modal interaction. Lastly, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks together to thoroughly assess cross-modal interaction capabilities and benchmark competing methods. Our new benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. Our proposed novel method, with enhanced cross-modal alignment, shows superior sound source localization performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using new and standard evaluation metrics.
Paper Structure (20 sections, 5 equations, 6 figures, 13 tables)

This paper contains 20 sections, 5 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: A conceptual difference between prior approaches and our alignment-based sound source localization.
  • Figure 2: Our sound source localization framework. Our model construct multiple positive pairs with augmentation and Nearest Neighbor Search (semantically Similar Samples). By using these newly constructed 9 pairs, our model employs spatial localization, $s_L$, and semantic feature alignment, $s_A$, for each pair to learn a better sound source localization ability.
  • Figure 5: Qualitative comparison of cIoU and Adaptive cIoU in the area used for quantitative analysis. (a), (b), and (c) depict the audio-visual attention map results, the predicted area from the perspective of cIoU, and the perspective of Adaptive cIoU, respectively. The gray color signifies the background. The ground truth bounding box is annotated in green. Although the localization area successfully covers the bounding box in (b), the sample cannot be considered correct since the prediction is much larger than the ground truth. However, Adaptive cIoU better evaluates model performance with small ground truth sizes.
  • Figure 6: Qualitative sound source localization results.
  • Figure 7: Compositional image retrieval. Our method retrieves the desired images based on the given image and audio. We use simple multimodal embedding space arithmetic for compositional image retrieval. Due to the strong cross-modal alignment, our method achieves meaningful results in compositional image retrieval with straightforward vector arithmetic.
  • ...and 1 more figures