Table of Contents
Fetching ...

Self-Supervised Audio-Visual Soundscape Stylization

Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli

TL;DR

This work introduces audio-visual soundscape stylization, a task to restyle input speech to resemble a target scene using an audio-visual conditional example. It employs a self supervised framework based on audio-visual speech de enhancement and a conditional latent diffusion model to transfer both acoustic properties and ambient textures from the conditioning clip, trained entirely on in the wild video data. The approach leverages latent encoders, cross modal fusion with CLAP and CLIP, classifier free guidance, and a two stage processing pipeline to reconstruct high quality waveforms with HiFi GAN. Experimental results on CityWalk and Acoustic-AVSpeech show superior objective and subjective performance compared to baselines, with visual conditioning providing additional gains and good generalization to non speech sounds, albeit with noted limitations and potential for misuse in disinformation contexts.

Abstract

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/

Self-Supervised Audio-Visual Soundscape Stylization

TL;DR

This work introduces audio-visual soundscape stylization, a task to restyle input speech to resemble a target scene using an audio-visual conditional example. It employs a self supervised framework based on audio-visual speech de enhancement and a conditional latent diffusion model to transfer both acoustic properties and ambient textures from the conditioning clip, trained entirely on in the wild video data. The approach leverages latent encoders, cross modal fusion with CLAP and CLIP, classifier free guidance, and a two stage processing pipeline to reconstruct high quality waveforms with HiFi GAN. Experimental results on CityWalk and Acoustic-AVSpeech show superior objective and subjective performance compared to baselines, with visual conditioning providing additional gains and good generalization to non speech sounds, albeit with noted limitations and potential for misuse in disinformation contexts.

Abstract

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/
Paper Structure (53 sections, 2 equations, 13 figures, 11 tables)

This paper contains 53 sections, 2 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Audio-visual soundscape stylization. We learn through self-supervision to manipulate input speech (middle) such that it sounds as though it were recorded within a given scene (left). Our approach captures both acoustic properties, such as reverberation, as well as the ambient sounds, such as crashing waves (top). To help convey the results of the stylization, we have used source separation to visualize the speech waveform (shown in red) separately from background sound (shown in blue).
  • Figure 2: Soundscape stylization by conditional speech de-enhancement. We randomly select two disjoint clips from a video, designating one as a conditional example and the other as the target. We then separate and enhance the target audio. Our model's self-supervised pretext task is to remove this enhancement using the other conditional (audio, visual, or audio-visual) signal as a hint. At test time, we stylize an audio clip using a conditional example from the desired scene.
  • Figure 3: Model architecture. Given input audio derived from an enhancement model, and the conditional audio-visual clip sampled from the same video, we aim to stylize the input to closely resemble the original signal. We encode both the input and target spectrograms to the latent space using a pre-trained latent encoder, and feed them into a latent diffusion model together with the conditional audio-visual embedding. The goal is to harmonize the encoded latent of the input spectrogram with the target one. Finally, we employ a pre-trained latent decoder followed by a pre-trained HiFi-GAN vocoder to reconstruct the waveform from the latent space. Note that the latent encoder for the target spectrogram is not used at test time.
  • Figure 4: Model comparison. We show soundscape stylization results for several models, where each input audio is conditioned on two different audio-visual clips.
  • Figure 5: Qualitative generalization results. We restyle audio from LRS son2017lip conditioned on audio-visual (or visual-only) clips taken from AVSpeech ephrat2018looking.
  • ...and 8 more figures