Taming Visually Guided Sound Generation
Vladimir Iashin, Esa Rahtu
TL;DR
This work tackles open-domain, visually guided sound generation by introducing a single multi-class model that surpasses real-time playback speed. It combines a perceptually rich spectrogram codebook (Spectrogram VQGAN) with a vision-conditioned autoregressive sampler and a MelGAN vocoder, guided by a novel LPAPS perceptual loss and Melception-based fidelity/relevance metrics. The framework is evaluated on large-scale open-domain datasets (VGGSound and VAS), showing improved fidelity, relevance, and efficiency relative to state-of-the-art baselines, with extensive ablations and qualitative analyses. The approach enables scalable, high-quality cross-modal audio synthesis suitable for practical applications in film, music, and AI-assisted media production.
Abstract
Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN
