Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation
Darius Petermann, Mahdi M. Kalayeh
TL;DR
This work tackles the scarcity and restricted diversity of ground-truth audio-visual pairs by proposing image sonification: a retrieval-driven framework that pairs images from uni-modal datasets with semantically aligned audio from another pool using vision-language models. A cross-modal embedding space (via CLAP and AST) guides the retrieval and representation, enabling training of a diffusion-based audio-to-image generator that performs competitively on multiple benchmarks, even when evaluated out-of-domain. The authors show emergent auditory-inspired controls such as semantic mixing, loudness-driven weighting, and reverberation cues in generated visuals, and they provide ablations to dissect these phenomena. The approach significantly broadens the data sources and domain coverage for audio-conditioned image generation and is complemented by open-source code, weights, and a large sonified image-audio dataset of around $1$ million images.
Abstract
Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.
