Table of Contents
Fetching ...

EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Aashish Rai, Srinath Sridhar

TL;DR

EgoSonics tackles the challenge of generating synchronized audio for silent egocentric videos by casting audio synthesis as spectrogram generation conditioned on video embeddings. It introduces SyncroNet, a time-aware extension of ControlNet that delivers pixel-level control signals to a latent diffusion model, enabling per-frame synchronization at 30 fps and higher-frequency content up to 20 kHz. A new Video-Audio Alignment Score (VAAS) based on ViT features provides a standardized synchronization metric, and EgoSonics achieves state-of-the-art results on the Ego4D dataset, improving FID, IS, and VAAS over baselines. The approach also demonstrates downstream benefits for video summarization, highlighting practical impact in AR/VR, assistive tech, and dataset augmentation, while acknowledging limitations in occlusion scenarios and data scarcity.

Abstract

We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strengths of latent diffusion models for conditioned audio synthesis. We first encode and process paired audio-video data to make them suitable for generation. The encoded data is then used to train a model that can generate an audio track that captures the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables generation of temporally synchronized audio. Extensive evaluations and a comprehensive user study show that our model outperforms existing work in audio quality, and in our proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.

EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

TL;DR

EgoSonics tackles the challenge of generating synchronized audio for silent egocentric videos by casting audio synthesis as spectrogram generation conditioned on video embeddings. It introduces SyncroNet, a time-aware extension of ControlNet that delivers pixel-level control signals to a latent diffusion model, enabling per-frame synchronization at 30 fps and higher-frequency content up to 20 kHz. A new Video-Audio Alignment Score (VAAS) based on ViT features provides a standardized synchronization metric, and EgoSonics achieves state-of-the-art results on the Ego4D dataset, improving FID, IS, and VAAS over baselines. The approach also demonstrates downstream benefits for video summarization, highlighting practical impact in AR/VR, assistive tech, and dataset augmentation, while acknowledging limitations in occlusion scenarios and data scarcity.

Abstract

We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strengths of latent diffusion models for conditioned audio synthesis. We first encode and process paired audio-video data to make them suitable for generation. The encoded data is then used to train a model that can generate an audio track that captures the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables generation of temporally synchronized audio. Extensive evaluations and a comprehensive user study show that our model outperforms existing work in audio quality, and in our proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.
Paper Structure (25 sections, 6 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: We present EgoSonics, a method to synthesize audio tracks conditioned on silent in-the-wild egocentric videos. Our method operate on videos at 30 fps, and can synthesize audio that is semantically meaningful and synchronized with events in the video ("dish hits the sink" or "lawn mower turned off"). We also propose a new method to evaluate audio-video synchronization quality.
  • Figure 2: The overall architecture of our proposed method - EgoSonics. The input video frames are encoded through a video encoder to get video embedding $E_V$. This video embedding goes to the SyncroNet\ref{['subsec:syncronet']} which generates several control signals to control the generation of audio spectrograms by providing pixel-level temporal control to a pre-trained Stable Diffusion (SD) \ref{['subsec:ldm']}. An MLP translates the video embedding into text embedding $c_t$ for SD. The loss between the ground truth audio spectrogram $E_A$ and predicted $E'_A$ is used to train the SyncroNet. Finally, as a post processing step, generated audio is upsampled using Audio SR module.
  • Figure 3: Figure describes the training of Syncronet model. A trainable copy of Stable Diffusion's UNet encoder generates control signals through zero convolution layers, providing pixel-level control to the pretrained UNet Decoder model. The UNet decoder generates a $64x64$ encoded feature map, which goes through VAE decoder to generate the predicted audio spectrogram $E'_A$.
  • Figure 4: Self-attention is applied to the sum of encoded video embedding and the noisy input sample. Then the cross-attention is applied between $E_V$ and the previous intermediate signal. Post which, they are passed through a linear layer followed by a zero convolution layer to get control signal $c^n$.
  • Figure 5: Figure compares how well each model can synchronize the generated audio with the GT audio. Our method can generate very synchronized audios with higher quality. On the other hand, Diff-Foley and Make-an-Audio fails to do so, and often fails to synthesize contextually correct audio. Diff-Foley can only synthesize 8 seconds long audio compared to 10 seconds for others.
  • ...and 4 more figures