Table of Contents
Fetching ...

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Bing Yang, Changsheng Quan, Yabo Wang, Pengyu Wang, Yujie Yang, Ying Fang, Nian Shao, Hui Bu, Xin Xu, Xiaofei Li

TL;DR

The paper tackles the gap between simulated and real acoustics in multichannel speech processing by introducing RealMAN, a large-scale real-recorded dataset with a 32-channel microphone array, diverse indoor/outdoor scenes, and comprehensive annotations for direct-path speech and source location. It details the recording system, data collection, and annotation pipeline, including a GCC-based direct-path speech estimation and LED-based speaker localization using a fisheye camera. Baseline experiments compare real versus simulated training for speech enhancement and source localization, demonstrating that real-data training yields superior real-world performance and that real ambient noise exhibits complex, time-varying spatial correlations that are hard to emulate. The study also shows that variable-array networks trained on sub-arrays can generalize to unseen arrays, providing a practical path for deploying real-world, array-agnostic models. Overall, RealMAN offers a robust benchmark and training resource to advance realistic, multichannel speech processing systems.

Abstract

The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals (about 35 hours of Mandarin speech). A total of 83.7 hours of speech signals (about 48.3 hours for static speaker and 35.4 hours for moving speaker) are recorded in 32 different scenes, and 144.5 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, speaker location is annotated with an omni-directional fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter.

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

TL;DR

The paper tackles the gap between simulated and real acoustics in multichannel speech processing by introducing RealMAN, a large-scale real-recorded dataset with a 32-channel microphone array, diverse indoor/outdoor scenes, and comprehensive annotations for direct-path speech and source location. It details the recording system, data collection, and annotation pipeline, including a GCC-based direct-path speech estimation and LED-based speaker localization using a fisheye camera. Baseline experiments compare real versus simulated training for speech enhancement and source localization, demonstrating that real-data training yields superior real-world performance and that real ambient noise exhibits complex, time-varying spatial correlations that are hard to emulate. The study also shows that variable-array networks trained on sub-arrays can generalize to unseen arrays, providing a practical path for deploying real-world, array-agnostic models. Overall, RealMAN offers a robust benchmark and training resource to advance realistic, multichannel speech processing systems.

Abstract

The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals (about 35 hours of Mandarin speech). A total of 83.7 hours of speech signals (about 48.3 hours for static speaker and 35.4 hours for moving speaker) are recorded in 32 different scenes, and 144.5 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, speaker location is annotated with an omni-directional fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter.
Paper Structure (30 sections, 1 equation, 10 figures, 13 tables, 3 algorithms)

This paper contains 30 sections, 1 equation, 10 figures, 13 tables, 3 algorithms.

Figures (10)

  • Figure 1: Recording devices.
  • Figure 2: Speech duration statistics across speakers in the RealMAN dataset. In the naming of speaker ID, the one beginning with 'P' denotes speaker in read speech, and the other denotes speaker in free talk. The speaker ID is color-coded to distinguish the train, validation, and test sets.
  • Figure 3: Histogram of speaker azimuth, elevation and distance.
  • Figure 4: Four typical types of speaker moving trajectory. For each trajectory, the speaker height is constant, thus the moving trajectory is visualized only in the horizontal plane. Colors from lighter to darker indicate the time evolving.
  • Figure 5: Histogram of SNR for validation and test sets.
  • ...and 5 more figures