Table of Contents
Fetching ...

Visual Acoustic Fields

Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang

TL;DR

Visual Acoustic Fields introduces a 3D vision–audio framework that links visual appearance and impact sounds inside a scene using 3D Gaussian Splatting. It combines a vision-conditioned sound generation path via a diffusion model conditioned on multi-level visual features with a 3D sound localization path that uses AudioCLIP-aligned features within a feature-augmented 3DGS. A novel data-collection pipeline yields centimeter-level alignment of visuals, impact locations, and sounds across 15 real-world scenes, producing roughly 2{,}000 visual-sound pairs. Experiments demonstrate plausible impact sound synthesis and accurate 3D sound localization, establishing a new dataset and open-source tools for multimodal reasoning in 3D spaces.

Abstract

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.

Visual Acoustic Fields

TL;DR

Visual Acoustic Fields introduces a 3D vision–audio framework that links visual appearance and impact sounds inside a scene using 3D Gaussian Splatting. It combines a vision-conditioned sound generation path via a diffusion model conditioned on multi-level visual features with a 3D sound localization path that uses AudioCLIP-aligned features within a feature-augmented 3DGS. A novel data-collection pipeline yields centimeter-level alignment of visuals, impact locations, and sounds across 15 real-world scenes, producing roughly 2{,}000 visual-sound pairs. Experiments demonstrate plausible impact sound synthesis and accurate 3D sound localization, establishing a new dataset and open-source tools for multimodal reasoning in 3D spaces.

Abstract

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of Visual Acoustic Fields, a novel framework for integrating visual and auditory signals within a 3D scene. Our approach leverages 3D Gaussian Splatting (3DGS) to represent the scene and associates it with impact sounds. The framework enables two key tasks: vision-conditioned sound generation, where impact sounds are synthesized based on impact location, and sound localization, where the model identifies the source of a given sound within the 3D environment.
  • Figure 2: Pipeline for data collection. A novel re-rendering strategy is proposed to enable accurate annotation of impact sounds and their locations without introducing artifacts. 1) We capture two sets of multiview images including $I$ of the scene and $I^h$ marked with visible hitting markers and synchronized with corresponding hitting sounds. 2) Using Structure-from-Motion (SfM), we jointly estimate camera poses of $I$ and $I^h$, as denoted by $P$ and $P^h$, respectively. The impact locations can be obtained by detecting the markers with OWL-v2 minderer2023scaling, which are further projected to 3D location with known camera poses $P^h$ and depth map. 3) A 3DGS can be trained with multiview images $I$ and camera poses $P$. The images with impact locations are re-rendered without markers from the 3DGS with camera poses $P^h$, yielding clean images $\underline{I^h}$ with paired hitting sounds and their hitting positions.
  • Figure 3: Overview of the Visual Acoustic Fields framework. The model consists of two main components: sound generation and sound localization. Given multiview images, a feature-augmented 3D Gaussian Splatting (feature 3DGS) representation is constructed. For sound generation, localized multi-level features queried from the feature 3DGS are used as conditions to fine-tune a pretrained Stable Audio diffusion model to synthesize impact sounds. For sound localization, a fine-tuned AudioCLIP encoder maps input audio queries to the feature 3DGS, allowing the model to localize the corresponding impact location by computing feature similarity. Trainable, frozen, and fine-tuned components are indicated in the diagram.
  • Figure 4: Example scenes in our dataset. Our dataset consists of 15 diverse environments, including indoor and outdoor settings such as a bedroom, kitchen, bathroom, office, library, coffee corner, and garden. Each scene contains various materials (e.g., wood, metal, plastic, ceramic) and impact locations, yielding a rich collection of visual-audio pairs for training and evaluation.
  • Figure 5: Visualization of sound localization results. Given an input hitting sound, our model predicts the most relevant impact location within the 3D scene. (a) The heatmap represents the localization confidence scores, where brighter regions indicate higher confidence for the predicted sound source. (b) The highlighted region denotes the final localized impact objects (or parts).
  • ...and 1 more figures