Table of Contents
Fetching ...

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Tae-Hyun Oh

TL;DR

Sound2Vision tackles the challenge of generating diverse visuals from audio by learning a cross-modal latent alignment between audio and anchored visual features. It enriches audio representations with visual context and trains on highly correlated audio-visual moments using a sound source localization module, then feeds the aligned audio features into a pre-trained image generator to synthesize images. The approach achieves superior results on VEGAS and VGGSound, offers intuitive controllability via waveform and latent-space manipulations, and is shown to generalize across architectural choices (GANs and Latent Diffusion Models) and dataset types, including CelebV-HQ. By analyzing the geometry of the audio-visual embedding space and reducing the modality gap, Sound2Vision demonstrates robust cross-modal transferability and broad applicability for cross-modal generation tasks with minimal supervision.

Abstract

How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

TL;DR

Sound2Vision tackles the challenge of generating diverse visuals from audio by learning a cross-modal latent alignment between audio and anchored visual features. It enriches audio representations with visual context and trains on highly correlated audio-visual moments using a sound source localization module, then feeds the aligned audio features into a pre-trained image generator to synthesize images. The approach achieves superior results on VEGAS and VGGSound, offers intuitive controllability via waveform and latent-space manipulations, and is shown to generalize across architectural choices (GANs and Latent Diffusion Models) and dataset types, including CelebV-HQ. By analyzing the geometry of the audio-visual embedding space and reducing the modality gap, Sound2Vision demonstrates robust cross-modal transferability and broad applicability for cross-modal generation tasks with minimal supervision.

Abstract

How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.

Paper Structure

This paper contains 47 sections, 4 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Sound-to-image generation. We propose a model that synthesizes images of natural scenes from the sound. Our model is trained solely from paired audio-visual data, without labels or language supervision. Our model's predictions can be controlled by applying simple manipulations to the input waveforms (left), such as by mixing two sounds together or by adjusting the volume. We can also control our model's outputs in latent space, such as by interpolating in directions specified by sound (right).
  • Figure 2: Sound2Vision framework. First, the frame selection method selects the highly correlated frame-audio segment from a video for training. Then, we train Sound2Vision to produce an audio feature that aligns with the visual feature extracted from the pre-trained image encoder. In the inference stage, the extracted audio feature from input audio is fed to the image generator to produce an image.
  • Figure 3: Examples of comparison between the selected top-1 frame versus mid-frame in the video.
  • Figure 4: Qualitative results by feeding single waveform from VGGSound test set. Sound2Vision generates diverse images in a wide variety of categories from generic sounds as input.
  • Figure 5: Grad-CAM gradcam visualization for the highlighted moment in the spectrograms. In the heatmap, regions most highlighted during image generation are colored red, transitioning to blue in less highlighted areas. , , , and denote wind blowing, elk bugling, skiing, and human talking sounds, respectively.
  • ...and 16 more figures