Table of Contents
Fetching ...

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

TL;DR

This work tackles the problem of generating realistic action sounds from silent egocentric videos by disentangling foreground action sounds from ambient background. It introduces AV-LDM, an ambient-aware audio-visual latent diffusion model that conditions on both video content and an ambient audio cue drawn from the training data, enabling retrieval-augmented, controllable generation. A key contribution is the training strategy that uses a nearby audio clip $A_n$ from the same long video to separate action cues from persistent ambient sounds, and at inference time leverages retrieval-based conditioning to adapt to the given visual scene. The approach achieves state-of-the-art results on Ego4D-Sounds and EPIC-KITCHENS across objective metrics and human judgments, while offering action-focused generation and ambient control, with promising generalization to VR/game clips. Overall, the method broadens action-sound synthesis to in-the-wild data and provides a practical framework for realistic, controllable audio generation in immersive applications.

Abstract

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

TL;DR

This work tackles the problem of generating realistic action sounds from silent egocentric videos by disentangling foreground action sounds from ambient background. It introduces AV-LDM, an ambient-aware audio-visual latent diffusion model that conditions on both video content and an ambient audio cue drawn from the training data, enabling retrieval-augmented, controllable generation. A key contribution is the training strategy that uses a nearby audio clip from the same long video to separate action cues from persistent ambient sounds, and at inference time leverages retrieval-based conditioning to adapt to the given visual scene. The approach achieves state-of-the-art results on Ego4D-Sounds and EPIC-KITCHENS across objective metrics and human judgments, while offering action-focused generation and ambient control, with promising generalization to VR/game clips. Overall, the method broadens action-sound synthesis to in-the-wild data and provides a practical framework for realistic, controllable audio generation in immersive applications.

Abstract

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
Paper Structure (29 sections, 5 equations, 10 figures, 4 tables)

This paper contains 29 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Real-world audio consists of both foreground action sounds (whose causes are visible) and background ambient sounds generated by sources offscreen. Whereas prior generation work is agnostic to this division, our method is ambient-aware and disentangles action sound from ambient sound. Our key technical insight is how to train with in-the-wild videos exhibiting natural ambient sounds, while still learning to factor out their effects on generation. The green arrows reference how we condition generation on sound from a related, but time-distinct, video clip to achieve this.
  • Figure 2: Illustration of the harm of ambient sound in video-to-audio generation. In this example, this person is closing a packet of ginger powder, which makes some rustling sound (red circled in the middle). There is also some buzzing sound semantically irrelevant to the visual scene in the background, which dominates the energy of the spectrogram. On the right-hand side, we show a prediction made by a vanilla model that misses the action sound but predicts the ambient sound.
  • Figure 3: Audio condition selection and the model architecture. Left: During training, we randomly sample a neighbor audio clip as the audio condition. For inference, we query the training set audio with the (silent) input video and retrieve an audio clip that has the highest audio-visual similarity with the input video using our trained AV-Sim model (\ref{['sec:pretrain']}). Right: We represent audio waveforms as spectrograms and use a latent diffusion model to generate the spectrogram conditioned on both the input video and the audio condition. At test time, we use a trained vocoder network to transform the spectrogram to a waveform.
  • Figure 4: Two inference settings: "action-ambient joint generation" and "action-focused generation". In the first setting, we condition on audio retrieved from the training set and aim to generate both plausible action and ambient sounds. In the second setting, we condition on an audio file with low ambient sound and the model focuses on generating plausible action sounds while minimizing the ambient sounds.
  • Figure 5: Example clips in Ego4D-Sounds. We show one video frame, the action description, and the sound for each example. Note how these actions are subtle and long-tail, usually not present in typical video datasets.
  • ...and 5 more figures