Table of Contents
Fetching ...

SonoWorld: From One Image to a 3D Audio-Visual Scene

Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao

Abstract

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

SonoWorld: From One Image to a 3D Audio-Visual Scene

Abstract

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

Paper Structure

This paper contains 70 sections, 44 equations, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: From one image, SonoWorld generates an explorable 3D audio-visual scene, where you can navigate to novel views and locations, while listening to spatial audio aligned with scene semantics and the 3D locations of heterogeneous sound sources.
  • Figure 2: Illustration of real-world audio-visual scene data collection and curation for SonoScene360.
  • Figure 3: Given a single image $I$, SonoWorld jointly generates a 3D visual scene $\mathbf{V}$ and a semantically and geometrically aligned audio scene ${\bf A}$. It consists of: 1) Visual Scene Generation (Sec. \ref{['sec:scene_generation']}): single-image calibration and warping followed by panorama outpainting to obtain the a full $360^\circ$ panorama image $I_{\mathrm{pano}}$, which is further lifted into a 3D Gaussian scene via a panorama-to-3D reconstruction model $\mathcal{G}_{\mathrm{\textbf{V}}}$; 2) 360$^\circ$ Semantic Grounding (Sec. \ref{['sec:semantic_grounding']}): a VLM extracts the categories $\mathcal{C}$ of potential sounding sources, which are used to generate panoramic instance masks $\textbf{M}$ with the help and coordination of both an open-vocabulary segmentation model (OVS) and a class-agnostic segmentation model (SAM2 ravi2024sam2); 3) Ambisonics Encoding (Sec. \ref{['sec:foa_rendering']}): based on the audio prompt and equalization parameters from the VLM model, a text-to-audio (T2A) model generates per-source waveforms that are equalized, and mapped to ambisonics coefficients based on the 3D locations and its source type; and 4) Free-Viewpoint Rendering (Sec. \ref{['sec:inference']}): the ambisonics coefficients are decoded into pose-dependent binaural audio $\textbf{b}(\textbf{p})$ using head related transfer function (HRTF) and synchronized with Gaussian rendering ${\bf V}(\textbf{p})$.
  • Figure 4: Per-scene results on our SonoScene360 dataset. Results on representative metrics show that our method consistently outperforms baselines on all scenes.
  • Figure 5: User Study of Spatial Audio Quality. Preference rates across synthetic and real scenes for our method, MMAudio mmaudio, and OmniAudio liu2025omniaudiogeneratingspatialaudio. Each bar shows the average per-user, per-scene preference (%); error bars indicate the interquartile range (25th--75th percentile) across 50 participants.
  • ...and 12 more figures