Table of Contents
Fetching ...

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao

TL;DR

MRSAudio tackles the scarcity of large-scale, richly annotated spatial audio datasets by introducing a 484-hour multimodal corpus spanning four real-world subsets (MRSLife, MRSSpeech, MRSSing, MRSMusic) with synchronized binaural and FOA audio, video, 3D pose data, transcripts, lyrics, scores, and motion prompts. The authors define five benchmark tasks—audio spatialization, spatial text-to-speech, spatial singing voice synthesis, spatial music generation, and sound event localization and detection—and provide evaluation protocols and strong baselines (notably diffusion-based and end-to-end spatial models). Experiments across these tasks demonstrate MRSAudio’s capacity to support high-quality spatial modeling and cross-modal generation, highlighting improvements over traditional baselines in spatial fidelity, intelligibility, and pitch accuracy. The dataset offers significant potential for advancing immersive AR/VR applications, perceptual scene analysis, and controllable spatial audio synthesis, while recognizing limitations in visual modality usage and FOA scale for future work and community impact considerations.

Abstract

Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

TL;DR

MRSAudio tackles the scarcity of large-scale, richly annotated spatial audio datasets by introducing a 484-hour multimodal corpus spanning four real-world subsets (MRSLife, MRSSpeech, MRSSing, MRSMusic) with synchronized binaural and FOA audio, video, 3D pose data, transcripts, lyrics, scores, and motion prompts. The authors define five benchmark tasks—audio spatialization, spatial text-to-speech, spatial singing voice synthesis, spatial music generation, and sound event localization and detection—and provide evaluation protocols and strong baselines (notably diffusion-based and end-to-end spatial models). Experiments across these tasks demonstrate MRSAudio’s capacity to support high-quality spatial modeling and cross-modal generation, highlighting improvements over traditional baselines in spatial fidelity, intelligibility, and pitch accuracy. The dataset offers significant potential for advancing immersive AR/VR applications, perceptual scene analysis, and controllable spatial audio synthesis, while recognizing limitations in visual modality usage and FOA scale for future work and community impact considerations.

Abstract

Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

Paper Structure

This paper contains 39 sections, 1 equation, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Overview of MRSAudio. The dataset comprises four real-world scenarios: MRSSpeech, MRSLife, MRSMusic, and MRSSing, each with multimodal annotations for spatial audio research.
  • Figure 2: The pipeline of data collection and processing of MRSAudio. The blue boxes indicate steps requiring manual intervention, while the green boxes denote automated processing. In the “auto-processing” section, dashed modules apply to some scenarios, while solid modules apply to all.
  • Figure 3: Statistical overview of MRSAudio. (a) Spatial distribution of sound sources relative to the listener. Red, green, and blue arrows denote the positive x, y, and z axes; azimuth is measured around the z-axis from the x-axis, and elevation is relative to the xy-plane. (b) Word cloud. (c) Proportions of recording spaces by room size. (d) Distribution of audio segment durations.
  • Figure 4: Statistical overview of MRSDialogue. (a) Spatial distribution of sound sources relative to the listener. Red, green, and blue arrows denote the positive x, y, and z axes; azimuth is measured around the z-axis from the x-axis, and elevation is relative to the xy-plane. (b) Distribution of phones in MRSDialogue.
  • Figure 5: Statistical overview of MRSSound. (a) Spatial distribution of sound sources relative to the listener. Red, green, and blue arrows denote the positive x, y, and z axes; azimuth is measured around the z-axis from the x-axis, and elevation is relative to the xy-plane. (b) Duration distribution of sound event duration in MRSSound.
  • ...and 4 more figures