Table of Contents
Fetching ...

AV-RIR: Audio-Visual Room Impulse Response Estimation

Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha

TL;DR

The paper tackles the challenge of estimating room impulse responses ($RIR$) by leveraging both reverberant speech and environmental visuals. It introduces AV-RIR, a multi-modal, multi-task framework built on a neural codec that jointly estimates $RIR$ and performs speech dereverberation, augmented by Geo-Mat features and an inference-time CRIP retrieval to improve late reverberation. Empirical results on SoundSpaces and AVSpeech show substantial improvements over audio-only and visual-only baselines in $RIR$ estimation and downstream speech processing tasks, with strong perceptual validation from human listeners. The work enables more realistic AR/VR audio rendering and robust speech processing, while outlining limitations (single-talker, noiseless, stationary scenarios) and future directions for multi-channel and noisy environments.

Abstract

Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech can be found at https://www.youtube.com/watch?v=tTsKhviukAE.

AV-RIR: Audio-Visual Room Impulse Response Estimation

TL;DR

The paper tackles the challenge of estimating room impulse responses () by leveraging both reverberant speech and environmental visuals. It introduces AV-RIR, a multi-modal, multi-task framework built on a neural codec that jointly estimates and performs speech dereverberation, augmented by Geo-Mat features and an inference-time CRIP retrieval to improve late reverberation. Empirical results on SoundSpaces and AVSpeech show substantial improvements over audio-only and visual-only baselines in estimation and downstream speech processing tasks, with strong perceptual validation from human listeners. The work enables more realistic AR/VR audio rendering and robust speech processing, while outlining limitations (single-talker, noiseless, stationary scenarios) and future directions for multi-channel and noisy environments.

Abstract

Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech can be found at https://www.youtube.com/watch?v=tTsKhviukAE.
Paper Structure (22 sections, 13 equations, 6 figures, 4 tables)

This paper contains 22 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of AV-RIR: Given a source reverberant speech in any environment, AV-RIR estimates the RIR from the reverberant speech using additional visual cues. The estimated RIR can be used to transform any target clean speech as if it is spoken in that environment.
  • Figure 2: Overview of our AV-RIR learning method: Given the input reverberant speech $\mathcal{S}_R$ from any source environment $\mathcal{E}_{S}$, the primary task of AV-RIR is to estimate the room impulse response $\mathcal{RIR}$ by separating it from the clean speech $\mathcal{S}_C$ (see Eq. \ref{['eqtn:reverberant_speech']}). The input $\mathcal{S}_R$ is first encoded using a Reverberant Speech Encoder $\mathcal{E}_{{R}}$. The latent output $\mathcal{S}_E$ is then passed to two different encoders in two different branches. While one of these branches solves the RIR estimation task, the other solves the speech dereverberation task by estimating $\mathcal{S}_C$. Outputs from both the Speech Dereverberation Encoder $\mathcal{S}_{EC}$ and RIR Encoder $\mathcal{R}_{E}$ are fused with ResNet-18 encodings from the panoramic image $\mathcal{I}_{P}$ and Geo-Mat feature $\mathcal{I}_{G}$ respectively. The output latent multi-modal encodings $\mathcal{I}_{EP}$ and $\mathcal{I}_{EG}$ are then passed to a trainable Residual Vector Quantization module (RVQ), which quantizes $\mathcal{F}_{S}$ to latent codes $\mathcal{Q}_{S}$, and $\mathcal{F}_{R}$ to latent codes $\mathcal{Q}_{R}$. Finally, the HiFi-GAN vocoder decodes the enhanced speech $\mathcal{S}_C$ from $\mathcal{Q}_{S}$ and the RIR decoder decodes estimated early components of RIR $\mathcal{R_{RS}}$ from $\mathcal{Q}_{R}$ which are used to calculate losses for training. At inference time, our CRIP retrieves an RIR from a database $\mathcal{DS}$ and is used to improve late reverberation in the estimated RIR. Finally, post addition, the final estimated RIR is convolved with any $\mathcal{S}_C$ to make it sound like it was uttered in $\mathcal{E}_{S}$.
  • Figure 3: The computation pipeline of Geo-Mat feature map. The first two channels of the Geo-Mat feature ($\mathcal{I}_{G}$) comprise the absorption coefficients ($\mathcal{AC}$) of each acoustic material. The third channel comprises the depth map. We illustrate objects in the environment having similar $\mathcal{AC}$ with similar colors: chairs and furniture with similar materials are represented in light blue, painting, and wall pictures with similar materials are represented in yellow, and the rest in grey. More details on the method to obtain $\mathcal{AC}$ is described in Section \ref{['subsec:Geo-Mat']}.
  • Figure 4: Illustration of CRIP training. Like CLIP clip, we propose two networks, one to encode a panoramic image and the other to encode the RIR to learn a joint embedding space between both. We use our CRIP-based image-to-RIR retrieval during inference to improve late reverberation in the estimated RIR from AV-RIR.
  • Figure 5: Qualitative Results. (Left) We show the Geo-Mat feature generated using our approach. The cushion chairs with a similar material absorption property are represented in green. The table and window with similar material are represented in red. (Right) We plot the time-domain representation of the RIRs estimated using prior methods and our approach with the ground truth (GT) RIR (GT: Red, Estimated: Blue). We also report the MSE (Eq. \ref{['rir_loss']}), $T_{60}$ error (RTE), and EDT error (EDT). It can be seen that the RIR estimated using our AV-RIR matches closely with GT RIR when compared with the baseline. Also, we can see that the RIR retrieved from our CRIP has similar late components as the GT RIR. However, the early component of the retrieved RIR (shown in zoom) significantly differs from the GT. Our full AV-RIR pipeline estimates the early components of the RIR using audio-visual features and adds the late component of the RIR from our CRIP to accurately predict the full RIR. The energy decay curve (EDC) depicts the energy remaining in the RIR over time EDC1. We can see that the EDC of the late component of RIR estimated from AV-RIR (yellow) matches closely with the GT RIR (purple).
  • ...and 1 more figures