Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
TL;DR
This work foregrounds cross-modal interaction as a core component of sound source localization, introducing IS3 as an interactive synthetic benchmark and two new metrics, Adaptive cIoU and IIoU, to evaluate localization alongside cross-modal interactivity. It proposes a cross-modal alignment framework that learns from multiple positive audio-visual samples—both hand-crafted augmentations and semantically similar priors—using dual losses for localization and semantic alignment via $s_L$ and $s_A$. The approach achieves state-of-the-art results across localization benchmarks, enables robust cross-modal retrieval, and demonstrates interactive localization capabilities on multi-source scenes and segmentation tasks, highlighting the importance of semantic alignment and multi-view supervision. Overall, the method provides a comprehensive, multi-task evaluation of audio-visual interaction and localization, with strong implications for practical audio-visual perception systems.
Abstract
Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability. Third, we propose a learning framework with a cross-modal alignment strategy to enhance cross-modal interaction. Lastly, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks together to thoroughly assess cross-modal interaction capabilities and benchmark competing methods. Our new benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. Our proposed novel method, with enhanced cross-modal alignment, shows superior sound source localization performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using new and standard evaluation metrics.
