DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation
Chunxi Wang, Maoshen Jia, Wenyu Jin
TL;DR
DARAS tackles blind RIR estimation from monaural reverberant speech by jointly extracting audio and room acoustic features, fusing them with a hybrid-path cross‑attention mechanism, and synthesizing RIRs via a dynamic early/late segmentation decoder. The MASS‑BRPE module enables efficient blind estimation of $\mathcal{V}$, RT$_{60}$, and $\mathcal{B}_{\mathrm{p}}$ using a Mamba state space model and self‑supervised pretraining, while the DAT decoder adapts early reflections and late reverberation based on estimated room parameters. Extensive experiments on seven real datasets plus virtual rooms show state‑of‑the‑art performance in BRPE and blind RIR estimation, with strong objective metrics ($\mathcal{L}_{\mathrm{STFT}}$, RT$_{60}$, DRR) and perceptual gains in MUSHRA tests. The approach reduces data collection burdens, enhances generalization to unseen rooms, and yields RIRs that closely match real acoustics, enabling more realistic AR/VR and speech processing pipelines. Future work may incorporate visual cues and larger-scale datasets to further improve robustness and realism.
Abstract
Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.
