Hearing Anything Anywhere

Mason Wang; Ryosuke Sawata; Samuel Clarke; Ruohan Gao; Shangzhe Wu; Jiajun Wu

Hearing Anything Anywhere

Mason Wang, Ryosuke Sawata, Samuel Clarke, Ruohan Gao, Shangzhe Wu, Jiajun Wu

TL;DR

Hearing Anything Anywhere introduces DiffRIR, a differentiable impulse response renderer that decomposes room acoustics into interpretable components for the source (localization, directivity, and impulse response) and surfaces (reflectivity and per-surface responses) while summing path contributions and a learned residual. Using only ~12 RIR measurements and a planar room representation, DiffRIR estimates RIRs and music at novel listener locations, delivering monoaural and binaural renderings with improved accuracy over baselines in real rooms. The authors collect a dedicated DiffRIR dataset across four diverse environments, validate with extensive experiments, and demonstrate interpretable parameters (directivity heatmaps and reflection coefficients) that enable virtual scene edits such as speaker rotation/translation and panel relocation. The framework offers robust data-efficient performance, supports binauralization via HRIRs, and provides practical insights for robotics and architectural acoustics, with code and data released for reproducibility.

Abstract

Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.

Hearing Anything Anywhere

TL;DR

Abstract

Paper Structure (82 sections, 8 equations, 15 figures, 15 tables)

This paper contains 82 sections, 8 equations, 15 figures, 15 tables.

Introduction
Related Work
Learning-Based Room Acoustics Prediction.
Audio-Visual (AV) Room Acoustics Prediction.
Geometry-Based RIR Simulation.
Differentiable Acoustics.
Method
Task Formulation
The DiffRIR Framework
Characterizing the Sound Source
Source Localization.
Source Directivity.
Source Impulse Response.
Modeling and Characterizing Reflections
Reflectivity.
...and 67 more sections

Figures (15)

Figure 1: Differentiable Room Impulse Response Rendering Framework (DiffRIR). Our model renders the contribution to the RIR of a single traced reflection path. After computing a reflection path, we characterize it by the direction at which it exits the speaker, its length, and the surfaces on which it reflects. The sound source has a learned frequency response that depends on the outgoing direction, and each surface has a different learned frequency response. We multiply each of these responses to estimate the overall path response. To determine the reflection path's time-domain contribution to the final RIR, we apply a minimum-phase inverse-Fourier transform to the path response, convolve it with the source impulse response, and then shift the result in time based on the path length and the speed of sound.
Figure 3: Visualization of our model's learned parameters. The left images show sample spherical heatmaps that our model fits to the speaker's directivity pattern when trained on 12 points from the Classroom subdataset. The green dot indicates the direction the speaker is facing, and the yellow regions indicate higher volume. The right image shows reflection amplitude responses that our model learns for various surfaces.
Figure 4: RIR loudness heatmaps generated from DiffRIR trained on 12 points in the Dampened Room's base subdataset.
Figure 5: Visualization of RIR loudness maps generated from our model trained in each of the four base subdatasets. We measure loudness by rendering an RIR at a given listener location and measuring its RMS volume level. For each RIR rendered, we fix the height of the listener location to be 1 meter above the floor. The resolution of each xy-grid is approximately 5 centimeters in both the x and y directions. We fix the location and orientation of the speaker (indicated by the black icon) to where it was during RIR measurement. The color scale is in decibels and is consistent between rooms. The green dots indicate the xy locations of the 12 training points, which are projected onto the $z=1$ plane.
Figure 6: Visualization of RIR loudness at 70 Hz in the Classroom subdataset. The sound field intensity at a given location is measured by filtering the ground-truth or predicted RIR around 70 Hz using a 2nd order Butterworth filter Butterworth1930 and measuring the RMS volume level of the filtered signal. Subfigure a) shows the intensity of the 70hz sound field at all locations in the subdataset. Subfigure b) shows predicted intensities at these same locations using our model trained on 12 points. We indicate the spatial locations of these 12 training points with green dots, and the speaker's location and orientation with a black icon. Subfigures c) through g) show the sound field intensity as predicted by each of our baseline models. Note that in subfigure d), the Linear baseline underestimates the soundfield intensity at locations far away from the training locations, since the linear interpolation at these locations is a weighted average of roughly uncorrelated signals whose mean is roughly zero.
...and 10 more figures

Hearing Anything Anywhere

TL;DR

Abstract

Hearing Anything Anywhere

Authors

TL;DR

Abstract

Table of Contents

Figures (15)