SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Changan Chen; Kumar Ashutosh; Rohit Girdhar; David Harwath; Kristen Grauman

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

TL;DR

This work introduces MC3, a self-supervised, three-way multimodal embedding that learns to map audio, vision, and language representations to consistently capture sounding actions from narrations in egocentric video. The core idea is to first align pairwise modality representations and then refine them with a consensus bottleneck that enforces agreement across all modalities for sounding actions, enabling discovery of long-tail sounds without audio labels. Evaluations on Ego4D and EPIC-Sounds show MC3 improves sounding action discovery, cross-modal retrieval, and audio classification, outperforming prior two- and multi-modal baselines. The approach leverages free-form narrations and a two-stage training scheme to robustly learn action-specific audio-visual correspondences with strong generalization potential for multimodal understanding and content generation.

Abstract

We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 9 figures, 8 tables)

This paper contains 36 sections, 4 equations, 9 figures, 8 tables.

Introduction
Related Work
Action/interaction/impact sound.
Audio-visual learning.
Language+X learning.
Multi-modal/view representation learning.
Task Formulation
Multimodal Contrastive-Consensus Coding
Align-Refine Two-stage Training
Multimodal Contrastive Coding
Multimodal Consensus Coding
Implementation Details
Training and Eval Data for Sounding Actions
Dataset.
Ground truth annotations for evaluation.
...and 21 more sections

Figures (9)

Figure 1: We aim to distinguish sounds that are directly caused by human actions (bottom) from those that are not (top). Given egocentric training videos with language descriptions of the camera wearer's ("C") current action, we learn an embedding where the audio and visual features of any given clip are best aligned only when both are also consistent with the language. This allows discerning clips where the audio and vision may be correlated (e.g., the cutting machine running making loud noise in top row) versus those where the sounds are driven by human action (digging in bottom row)---importantly, without language at inference time.
Figure 2: Main idea. On the left, the Venn diagram illustrates different ways audio ($A$), video ($V$) and language ($L$) modalities can overlap in the content they capture. C refers to the camera wearer. Regions II,III,IV are information that is only shared between two modalities but not the third, e.g., the racing game in ① where the game sounds correlate with the vision, yet are not about the camera wearer's described action (using hands on laptop), the lifting action in ③, where the visuals and language agree but the action is inaudible, and the off-screen talking action in ④, where talking is heard and described, but the camera wearer cannot be seen speaking. Region I is the information that corresponds to all modalities agreeing, e.g., the visible and audible plastering action in ②. Our model's "align" phase detects any such (dis)agreements via pairwise contrastive learning on the modalities. In the "refine" phase, we use the intersection of that agreement (region I) to refine the embedding. For example, on the right, we show what the three modality embeddings should look like after the "align" stage for examples 1 and 2. Embeddings of instances where all modalities agree will be closer in the embedding space and apart otherwise. In other words, for example 1, yellow (video) cannot be close to blue (audio) unless green is too (language).
Figure 3: Multimodal contrastive-consensus loss. (a): Given three modality embeddings $e_i^t$, $e_j^t$, $e_k^t$, multimodal contrastive coding pulls each pair of modalities closer while pushing modality pairs from another sample further away. (b): However, not all modalities agree on how close they should be depending on the instance. Thus we set the furthest distance a feature has with respect to the anchor feature as the consensus and push the remaining embeddings away to meet this consensus.
Figure 4: Long-tail distribution of sounding actions.
Figure 5: Sounding action discovery accuracy
...and 4 more figures

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

TL;DR

Abstract

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (9)