Table of Contents
Fetching ...

Segmenting Collision Sound Sources in Egocentric Videos

Kranti Kumar Parida, Omar Emara, Hazel Doughty, Dima Damen

TL;DR

Collision Sound Source Segmentation (CS3) addresses locating and segmenting objects responsible for collision sounds in egocentric video, conditioned on audio. The authors propose a weakly supervised pipeline that fuses audio-conditioned CLIP-based segmentation with hand-object interaction priors and SAM-based collision verification, enabling accurate segmentation without explicit object masks. They introduce two benchmarks, EPIC-CS3 and Ego4D-CS3, and demonstrate substantial improvements over baselines, with the full model achieving strong mIoU and AUC scores and robust performance on small, occluded, and multi-object collisions. This work advances audio-visual reasoning in cluttered, real-world settings and provides publicly available datasets and code for further research.

Abstract

Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3\times$ and $4.7\times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

Segmenting Collision Sound Sources in Egocentric Videos

TL;DR

Collision Sound Source Segmentation (CS3) addresses locating and segmenting objects responsible for collision sounds in egocentric video, conditioned on audio. The authors propose a weakly supervised pipeline that fuses audio-conditioned CLIP-based segmentation with hand-object interaction priors and SAM-based collision verification, enabling accurate segmentation without explicit object masks. They introduce two benchmarks, EPIC-CS3 and Ego4D-CS3, and demonstrate substantial improvements over baselines, with the full model achieving strong mIoU and AUC scores and robust performance on small, occluded, and multi-object collisions. This work advances audio-visual reasoning in cluttered, real-world settings and provides publicly available datasets and code for further research.

Abstract

Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by and in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

Paper Structure

This paper contains 16 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Task Definitions. (Top) Sound source localisation segments one object given its sound. (Middle) Multi-sound source localisation segments multiple objects from a mixture of their individually distinct sounds. (Bottom) We introduce a novel task, CS3, to segment the sources of collision sound by identifying the objects involved in the interaction, based on the impact sound.
  • Figure 2: Proposed Architecture Our architecture consists of three main components: (1) audio-conditioned segmentation, (2) hand-object interaction (HOI) and (3) collision verification. The audio-conditioned segmentation model takes an image ($\mathbf{I}$) and its corresponding audio ($\mathbf{A}$) to produce conditioning signals $\mathbf{I}_C$ and $\mathbf{A}_C$. The audio is first encoded into a representation aligned with the text token space, which is used alongside visual features to guide the localisation of sound-producing regions. The model is trained with image-level ($\mathcal{L}_{i}$), feature-level ($\mathcal{L}_{f}$), area regaularisation ($\mathcal{L}_{r}$) losses. The HOI model provides bounding boxes for in-hand left and right objects when present. The collision verification module uses SAM to extract object masks for audio-conditioned segmentation mask $\mathbf{M}_{av}$ and in-hand objects $\mathbf{M}_{\textrm{left}}$ and $\mathbf{M}_{\textrm{right}}$. A contact-based strategy is then applied to estimate the segmentations for collision sound sources, $\mathbf{M}_{coll.}$.
  • Figure 3: Distribution of EPIC-CS3 over the predicted sound class and noun category in the corresponding action.
  • Figure 4: Distribution of Ego4D-CS3 over the predicted sound classes and the scenario causing the sound.
  • Figure 5: Distribution of Masks Sizes by percentage of pixels occupied. Many small objects make segmentation challenging.
  • ...and 5 more figures