Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
TL;DR
This work tackles the problem of generating realistic action sounds from silent egocentric videos by disentangling foreground action sounds from ambient background. It introduces AV-LDM, an ambient-aware audio-visual latent diffusion model that conditions on both video content and an ambient audio cue drawn from the training data, enabling retrieval-augmented, controllable generation. A key contribution is the training strategy that uses a nearby audio clip $A_n$ from the same long video to separate action cues from persistent ambient sounds, and at inference time leverages retrieval-based conditioning to adapt to the given visual scene. The approach achieves state-of-the-art results on Ego4D-Sounds and EPIC-KITCHENS across objective metrics and human judgments, while offering action-focused generation and ambient control, with promising generalization to VR/game clips. Overall, the method broadens action-sound synthesis to in-the-wild data and provides a practical framework for realistic, controllable audio generation in immersive applications.
Abstract
Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
