Table of Contents
Fetching ...

Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

Darius Petermann, Mahdi M. Kalayeh

TL;DR

This work tackles the scarcity and restricted diversity of ground-truth audio-visual pairs by proposing image sonification: a retrieval-driven framework that pairs images from uni-modal datasets with semantically aligned audio from another pool using vision-language models. A cross-modal embedding space (via CLAP and AST) guides the retrieval and representation, enabling training of a diffusion-based audio-to-image generator that performs competitively on multiple benchmarks, even when evaluated out-of-domain. The authors show emergent auditory-inspired controls such as semantic mixing, loudness-driven weighting, and reverberation cues in generated visuals, and they provide ablations to dissect these phenomena. The approach significantly broadens the data sources and domain coverage for audio-conditioned image generation and is complemented by open-source code, weights, and a large sonified image-audio dataset of around $1$ million images.

Abstract

Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.

Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

TL;DR

This work tackles the scarcity and restricted diversity of ground-truth audio-visual pairs by proposing image sonification: a retrieval-driven framework that pairs images from uni-modal datasets with semantically aligned audio from another pool using vision-language models. A cross-modal embedding space (via CLAP and AST) guides the retrieval and representation, enabling training of a diffusion-based audio-to-image generator that performs competitively on multiple benchmarks, even when evaluated out-of-domain. The authors show emergent auditory-inspired controls such as semantic mixing, loudness-driven weighting, and reverberation cues in generated visuals, and they provide ablations to dissect these phenomena. The approach significantly broadens the data sources and domain coverage for audio-conditioned image generation and is complemented by open-source code, weights, and a large sonified image-audio dataset of around million images.

Abstract

Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.
Paper Structure (26 sections, 1 equation, 16 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 equation, 16 figures, 1 table, 1 algorithm.

Figures (16)

  • Figure 1: Our audio-visual data modeling demonstrates versatile control for image generation through audio manipulations, including loudness calibration, audio mixing, and reverberations, showcasing our model's adaptability across a wide and unconstrained data domain.
  • Figure 2: Visually-aligned (up) vs. sonically-aligned (down) image descriptions using LLaVAliu2023llava and CogVLM wang2023cogvlm. Each description is obtained via its respective prompt: "Provide a short and concise description of the following image." and "As a numbered list, provide one to up to three sound(s) associated with prominent objects visible and present in the image. Provide the objects followed by their associated sound". A large portion of the comprehensive description (red) does not pertain to acoustics properties whereas a few limited keywords do (green). Through handcrafted prompting we manage to obtain sonically-aligned and acoustically relevant descriptors.
  • Figure 3: Qualitative comparison of state-of-the-art audio-to-image generative models on four different datasets; Greatest Hits (top left), VEGAS (top right), Landscapes + ITW (bottom left), VGGSound (bottom right). Models are highlighted in green if they are evaluated on in-sample data and in red if evaluated on out-of-sample data. Our model consistently performs well on out-of-sample data, with results that are on par and most often exceeding those of in-sample models.
  • Figure 4: Examples of various semantics mixed together in the audio domain and their generated visual counterparts.
  • Figure 5: Impact of loudness variation for individual audio sources in the mix. From left to right, we increase the loudness for one of the sources while keeping the other one's fixed.
  • ...and 11 more figures