Table of Contents
Fetching ...

Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

Mengyu Yang, Yiming Chen, Haozheng Pei, Siddhant Agarwal, Arun Balajee Vasudevan, James Hays

TL;DR

This work tackles the challenge of linking everyday object interactions in egocentric footage to the sounds they produce by introducing sounding object detection and sounding action discovery. It combines an object-centric, multimodal framework—initialized with a pretrained slot-attention visual encoder and guided by automatic object masks—with a three-stage training regime (align, refine, finetune) to learn robust cross-modal representations across video, audio, and language. Key contributions include automatic object mask annotation for large-scale training, a dedicated sounding object detection benchmark with manually annotated ground truth, and state-of-the-art results on both sounding object detection and sounding action discovery across Ego4D and Epic Kitchens, supported by thorough ablations. The approach advances practical understanding of audio-visual-object associations in real-world, unconstrained settings, enabling more precise localization of the sound-producing object within complex scenes.

Abstract

Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

TL;DR

This work tackles the challenge of linking everyday object interactions in egocentric footage to the sounds they produce by introducing sounding object detection and sounding action discovery. It combines an object-centric, multimodal framework—initialized with a pretrained slot-attention visual encoder and guided by automatic object masks—with a three-stage training regime (align, refine, finetune) to learn robust cross-modal representations across video, audio, and language. Key contributions include automatic object mask annotation for large-scale training, a dedicated sounding object detection benchmark with manually annotated ground truth, and state-of-the-art results on both sounding object detection and sounding action discovery across Ego4D and Epic Kitchens, supported by thorough ablations. The approach advances practical understanding of audio-visual-object associations in real-world, unconstrained settings, enabling more precise localization of the sound-producing object within complex scenes.

Abstract

Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

Paper Structure

This paper contains 38 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Humans handle a wide variety of objects throughout the day and many of these interactions produce sounds. We introduce a multimodal object-aware framework that learns the relationship between the objects in an interaction and the resulting sounds. This enables our model to detect the sounding objects from a set of candidates in a scene.
  • Figure 2: We also evaluate our model on sounding action discovery chen2024soundingactions. The left example shows a sounding action, where cutting the grass directly produces the rustling sound. Meanwhile, the right example depicts a non-sounding action where the sound comes from the video and not the action of tapping the screen.
  • Figure 3: Left: Our object-aware visual features. Given a video frame and corresponding objects segmentation mask, we first encode the image into patch embeddings. We also patchify the mask to get a per-patch objectness score, corresponding to the percentage of the patch containing the object. The score informs the model on which patch embeddings to keep based on a threshold and the remaining embeddings are pooled into a single visual embedding vector. Right: The hard negatives paradigm used in the finetuning stage. Additional negative embeddings are sampled from non-interaction regions of the same image.
  • Figure 4: Examples of annotated frames from Ego4D from our benchmark, visualizing segmentation masks (red and green) of ground truth objects.
  • Figure 5: Qualitative results of our sounding object detection task. For each sub-figure from left to right: a) the original video frame, b) the ground truth object segmentation mask, and c) the audiovisual similarity score of each object region. The object regions in c) are detected using OWLv2 minderer2023scaling and SAM 2 ravi2024sam2. Dark blue is the lowest score and dark red is the highest. Refer to \ref{['app:colormap']} for the colormap.
  • ...and 6 more figures