Table of Contents
Fetching ...

DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation

Chunxi Wang, Maoshen Jia, Wenyu Jin

TL;DR

DARAS tackles blind RIR estimation from monaural reverberant speech by jointly extracting audio and room acoustic features, fusing them with a hybrid-path cross‑attention mechanism, and synthesizing RIRs via a dynamic early/late segmentation decoder. The MASS‑BRPE module enables efficient blind estimation of $\mathcal{V}$, RT$_{60}$, and $\mathcal{B}_{\mathrm{p}}$ using a Mamba state space model and self‑supervised pretraining, while the DAT decoder adapts early reflections and late reverberation based on estimated room parameters. Extensive experiments on seven real datasets plus virtual rooms show state‑of‑the‑art performance in BRPE and blind RIR estimation, with strong objective metrics ($\mathcal{L}_{\mathrm{STFT}}$, RT$_{60}$, DRR) and perceptual gains in MUSHRA tests. The approach reduces data collection burdens, enhances generalization to unseen rooms, and yields RIRs that closely match real acoustics, enabling more realistic AR/VR and speech processing pipelines. Future work may incorporate visual cues and larger-scale datasets to further improve robustness and realism.

Abstract

Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.

DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation

TL;DR

DARAS tackles blind RIR estimation from monaural reverberant speech by jointly extracting audio and room acoustic features, fusing them with a hybrid-path cross‑attention mechanism, and synthesizing RIRs via a dynamic early/late segmentation decoder. The MASS‑BRPE module enables efficient blind estimation of , RT, and using a Mamba state space model and self‑supervised pretraining, while the DAT decoder adapts early reflections and late reverberation based on estimated room parameters. Extensive experiments on seven real datasets plus virtual rooms show state‑of‑the‑art performance in BRPE and blind RIR estimation, with strong objective metrics (, RT, DRR) and perceptual gains in MUSHRA tests. The approach reduces data collection burdens, enhances generalization to unseen rooms, and yields RIRs that closely match real acoustics, enabling more realistic AR/VR and speech processing pipelines. Future work may incorporate visual cues and larger-scale datasets to further improve robustness and realism.

Abstract

Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.

Paper Structure

This paper contains 20 sections, 20 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the DARAS Blind RIR Estimation Model. This figure illustrates the proposed DARAS model designed to estimate RIR from monaural reverberant speech. The model comprises four modules: (1) a Deep Audio Encoder extracting nonlinear features from reverberant speech; (2) the MASS-BRPE module, employing state space models (SSMs) to estimate room acoustic parameters and features; (3) a Hybrid-Path Cross-attention Feature Fusion module, dynamically guiding audio features integration with room acoustic features to achieve refined reverberation-aware representations; and (4) a DAT Decoder, adaptively segmenting RIR into early reflections and late reverberation stages based on the boundary point ($\mathcal{B}_\mathrm{p}$) estimated by the MASS-BRPE module, synthesizing each stage individually.
  • Figure 2: The architecture of the deep audio encoder block.
  • Figure 3: Schematic diagram of the overall architecture of the proposed MASS-BRPE module.
  • Figure 4: Hybrid‑path cross‑attention feature fusion module.
  • Figure 5: Schematic diagram of the proposed DAT decoder for dynamically modeling early and late reverberation. The decoder divides the estimated RIR $\hat{\mathbf{h}}(n)$ into an early reverberation component $\hat{\mathbf{h}}_\mathrm{d}(n)$ and a late reverberation component based on the dynamic boundary point $\mathcal{B}_\mathrm{p}$.
  • ...and 4 more figures