Table of Contents
Fetching ...

Taming Visually Guided Sound Generation

Vladimir Iashin, Esa Rahtu

TL;DR

This work tackles open-domain, visually guided sound generation by introducing a single multi-class model that surpasses real-time playback speed. It combines a perceptually rich spectrogram codebook (Spectrogram VQGAN) with a vision-conditioned autoregressive sampler and a MelGAN vocoder, guided by a novel LPAPS perceptual loss and Melception-based fidelity/relevance metrics. The framework is evaluated on large-scale open-domain datasets (VGGSound and VAS), showing improved fidelity, relevance, and efficiency relative to state-of-the-art baselines, with extensive ablations and qualitative analyses. The approach enables scalable, high-quality cross-modal audio synthesis suitable for practical applications in film, music, and AI-assisted media production.

Abstract

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

Taming Visually Guided Sound Generation

TL;DR

This work tackles open-domain, visually guided sound generation by introducing a single multi-class model that surpasses real-time playback speed. It combines a perceptually rich spectrogram codebook (Spectrogram VQGAN) with a vision-conditioned autoregressive sampler and a MelGAN vocoder, guided by a novel LPAPS perceptual loss and Melception-based fidelity/relevance metrics. The framework is evaluated on large-scale open-domain datasets (VGGSound and VAS), showing improved fidelity, relevance, and efficiency relative to state-of-the-art baselines, with extensive ablations and qualitative analyses. The approach enables scalable, high-quality cross-modal audio synthesis suitable for practical applications in film, music, and AI-assisted media production.

Abstract

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

Paper Structure

This paper contains 51 sections, 4 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: A single model supports the generation of visually guided, high-fidelity sounds for multiple classes from an open-domain dataset faster than the time it will take to play it.
  • Figure 2: Vision-based Conditional Cross-modal Autoregressive Sampler. A transformer autoregressively samples the next codebook index given a sequence of visual features along with previously generated codebook indices. Once sampling is done, a sequence of generated indices is used to look up a pretrained codebook. Next, a pretrained codebook decoder is used to decode a spectrogram from a codebook representation. Finally, the generated spectrogram is turned into a waveform using a pretrained general-purpose spectrogram vocoder.
  • Figure 3: Training Perceptually-Rich Spectrogram Codebook. A spectrogram is passed through a 2D codebook encoder that effectively shrinks the spectrogram. Next, each element of a small-scale encoded representation is mapped to its closest neighbor from the codebook. A 2D codebook decoder is then used to reconstruct the input spectrogram. The training of the model is guided by codebook, reconstruction, adversarial, and LPAPS losses.
  • Figure 4: Samples produced by conditional cross-modal sampler are relevant and have high fidelity. The top row shows results of a model trained on VGGSound to sample from a VGGSound codebook ("from VGGSound for VGGSound"), the middle is "from VGGSound for VAS", the bottom is: "from VAS to VAS". An "opinion" of Melception is on the right.
  • Figure 5: "The great drum solo". The sample is generated by the model trained on $\sim$10-second spectrograms given 5 RGB and optical flow video frames. Generation time is shorter than it will take to play the sample.
  • ...and 15 more figures