Visual Acoustic Fields

Yuelei Li; Hyunjin Kim; Fangneng Zhan; Ri-Zhao Qiu; Mazeyu Ji; Xiaojun Shan; Xueyan Zou; Paul Liang; Hanspeter Pfister; Xiaolong Wang

Visual Acoustic Fields

Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang

TL;DR

Visual Acoustic Fields introduces a 3D vision–audio framework that links visual appearance and impact sounds inside a scene using 3D Gaussian Splatting. It combines a vision-conditioned sound generation path via a diffusion model conditioned on multi-level visual features with a 3D sound localization path that uses AudioCLIP-aligned features within a feature-augmented 3DGS. A novel data-collection pipeline yields centimeter-level alignment of visuals, impact locations, and sounds across 15 real-world scenes, producing roughly 2{,}000 visual-sound pairs. Experiments demonstrate plausible impact sound synthesis and accurate 3D sound localization, establishing a new dataset and open-source tools for multimodal reasoning in 3D spaces.

Abstract

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.

Visual Acoustic Fields

TL;DR

Abstract

Visual Acoustic Fields

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)