Table of Contents
Fetching ...

Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Siyi Xie, Hanxin Zhu, Tianyu He, Xin Li, Zhibo Chen

TL;DR

Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes and generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users.

Abstract

Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x-drunker.github.io/Sonic4D-project-page.

Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

TL;DR

Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes and generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users.

Abstract

Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x-drunker.github.io/Sonic4D-project-page.

Paper Structure

This paper contains 18 sections, 12 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Pipeline of Sonic4D. Our method is composed of three stages: 1) Dynamic Scene and Monaural Audio Generation: extracting semantically aligned visual scenes and audio priors from monocular videos; 2) 3D Sound-Source Localization and Tracking: estimating the sound source's trajectory in 3D space for physically accurate sound propagation modeling; 3) Physics-Driven Spatial Audio Synthesis: leveraging a physics-based room impulse response simulation to realize spatial audio simulation.
  • Figure 2: Illustration of Stage II: We localize the sound source in each frame using GroundingGPT li2024groundinggpt, back-project the 2D grounding results to 3D via depth, and apply DBSCAN ester1996density to obtain a smooth trajectory.
  • Figure 3: Qualitative results across different scenarios. We present comparisons of the spatial audio generated by Sonic4D conditioning on various camera trajectories, including static camera viewpoints, camera circling around the subject, rightward panning, and pulling out. These examples demonstrate the temporal and spatial alignment between the generated spatial audio and the motion of the visual subject.