Table of Contents
Fetching ...

Images that Sound: Composing Images and Sounds on a Single Canvas

Ziyang Chen, Daniel Geng, Andrew Owens

TL;DR

This work investigates generating spectrograms that simultaneously resemble natural images and sound like natural audio, by composing off-the-shelf text-to-image and text-to-spectrogram diffusion models in a shared latent space. The method denoises a latent with a multimodal noise estimate that combines both modalities, producing samples that lie at the intersection of image and spectrogram distributions and can be converted to waveforms via a vocoder. Quantitative metrics (CLIP, CLAP, FID, FAD) and human studies show the approach outperforms baselines and achieves strong audiovisual alignment, while enabling colorization for visual appeal. The results demonstrate a novel form of multimodal compositional generation with practical artistic potential, though limitations and societal considerations around steganography and model quality are acknowledged.

Abstract

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/

Images that Sound: Composing Images and Sounds on a Single Canvas

TL;DR

This work investigates generating spectrograms that simultaneously resemble natural images and sound like natural audio, by composing off-the-shelf text-to-image and text-to-spectrogram diffusion models in a shared latent space. The method denoises a latent with a multimodal noise estimate that combines both modalities, producing samples that lie at the intersection of image and spectrogram distributions and can be converted to waveforms via a vocoder. Quantitative metrics (CLIP, CLAP, FID, FAD) and human studies show the approach outperforms baselines and achieves strong audiovisual alignment, while enabling colorization for visual appeal. The results demonstrate a novel form of multimodal compositional generation with practical artistic potential, though limitations and societal considerations around steganography and model quality are acknowledged.

Abstract

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/
Paper Structure (48 sections, 6 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 48 sections, 6 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Images that sound. We use diffusion models to generate visual spectrograms (second row) that look like natural images, which we call images that sound. These spectrograms can be converted into natural sounds (third row) using a pretrained vocoder, or colorized to obtain more visually pleasing results (first row). Please refer to our \projecturl to listen to the sounds.
  • Figure 2: Images vs. spectrograms. We show grayscale images generated from Stable Diffusion rombach2022high on the left, followed by log-mel spectrograms generated from Auffusion xue2024auffusion in the middle, and our generated images that sound results on the right.
  • Figure 3: Composing audio and visual diffusion models. We generate the visual spectrogram that can be visualized as an image or played as a sound. Given a noisy latent $\mathbf{z}_t$, we apply visual and audio diffusion models, each guided by a text prompt, to compute noise estimates $\boldsymbol{\epsilon}_{v}^{(t)}$ and $\boldsymbol{\epsilon}_{a}^{(t)}$ respectively. We obtain the multimodal noise estimate $\tilde{\boldsymbol{\epsilon}}^{(t)}$ by a weighted average, then use it as part of the iterative denoising process. Finally, we decode the clean latent $\mathbf{z}_0$ to a spectrogram and convert it into a waveform using a pretrained vocoder (or by Griffin-Lim griffin1984signal).
  • Figure 4: Qualitative comparison. We show our qualitative results along with the imprint and SDS baselines given visual (first) and audio (second) prompts. Please zoom in for better viewing.
  • Figure 5: Qualitative examples with colorization results. We present 4 examples alongside their image prompts, audio prompts, and colorization prompts. Please refer to our \projecturl for video results.
  • ...and 6 more figures