Table of Contents
Fetching ...

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

TL;DR

RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques.

Abstract

We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. Demos and datasets are available on our project page: https://facebookresearch.github.io/real-acoustic-fields/

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

TL;DR

RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques.

Abstract

We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. Demos and datasets are available on our project page: https://facebookresearch.github.io/real-acoustic-fields/
Paper Structure (53 sections, 12 equations, 12 figures, 7 tables)

This paper contains 53 sections, 12 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Data capturing setup. (a) Audio capture (left): the loudspeaker and microphone recording system (Earful Tower) are placed at different locations within the room to measure and capture RIRs. (b) Visual capture (right): the camera rig (Eyeful Tower) moves around rooms to capture multi-view images for visual reconstruction and novel-view synthesis.
  • Figure 2: Data distribution of RAF. Blue dots represent speaker positions and red dots represent microphone positions. The room dimensions are shown on the right.
  • Figure 3: Sim2real method overview. First, we train the implicit network on simulated data with densely sampling emitter--listener position pairs. We then fine-tune it on sparse real-world data.
  • Figure 4: Visualization of generated RIRs from different methods. We visualize the ground-truth (in blue) and predicted (in red) impulse responses of several methods for qualitative comparison.
  • Figure 5: Few-shot RIR synthesis results. We evaluate the performances of models with different numbers of training data. The results are reported in the furnished room. Our Sim2Real method can improve the performance in cases of limited training data.
  • ...and 7 more figures