Table of Contents
Fetching ...

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak

TL;DR

This work presents a unified model for simultaneous grounding of mixed audio types—speech and non-speech sounds—within visual scenes using a mix-and-separate framework. It introduces a dual-head audio encoder with a shared visual backbone and two contrastive losses, ${\mathcal{L}}_{cor}$ and ${\mathcal{L}}_{dis}$, combined as ${\mathcal{L}}_{total}={\mathcal{L}}_{cor}+{\mathcal{L}}_{dis}$, to learn both cross-modal correspondences and disentanglement without reconstructing separated audio. A new Extended-IS3 dataset is created to evaluate simultaneous grounding, and extensive experiments show superior simultaneous grounding and competitive segmentation and retrieval performance compared with state-of-the-art baselines. The approach demonstrates strong disentanglement of audio types, robustness to mixed audio, and practical relevance for real-world audio-visual perception tasks.

Abstract

We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

TL;DR

This work presents a unified model for simultaneous grounding of mixed audio types—speech and non-speech sounds—within visual scenes using a mix-and-separate framework. It introduces a dual-head audio encoder with a shared visual backbone and two contrastive losses, and , combined as , to learn both cross-modal correspondences and disentanglement without reconstructing separated audio. A new Extended-IS3 dataset is created to evaluate simultaneous grounding, and extensive experiments show superior simultaneous grounding and competitive segmentation and retrieval performance compared with state-of-the-art baselines. The approach demonstrates strong disentanglement of audio types, robustness to mixed audio, and practical relevance for real-world audio-visual perception tasks.

Abstract

We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.

Paper Structure

This paper contains 31 sections, 17 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The pipeline of our framework. Visual and audio encoders extract features from clean audios, their paired images, and mixed audio inputs. These features are used to compute audio-visual similarities, which are then used in our correspondence and disentanglement losses. The correspondence objective ensures audio-visual matching, while the disentanglement objective enables feature-level separation of mixed audio sources.
  • Figure 2: Qualitative results for simultaneous audio-visual grounding on Extended IS3 dataset. Our model accurately localizes both overlapping audio types simultaneously within the mixed audio, whereas competing method hamilton2024separating cannot.
  • Figure 3: Sound prompted semantic segmentation on dataset from hamilton2024separating.
  • Figure 4: Speech prompted semantic segmentation on dataset from hamilton2024separating.
  • Figure 5: Simultaneous semantic segmentation on Extended IS3
  • ...and 2 more figures