Table of Contents
Fetching ...

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification

Jacob Bitterman, Daniel Levi, Hilel Hagai Diamandi, Sharon Gannot, Tal Rosenwein

TL;DR

RevRIR addresses room fingerprinting from reverberant speech by learning a joint embedding space for reverberant speech and room impulse responses using a dual-encoder architecture with a symmetric contrastive loss. Pre-training aligns embeddings so room acoustics dominate, while fine-tuning attaches a linear classifier to one encoder for room-shape classification. The approach yields strong results on simulated data, with high accuracy for both 110-room and 3-room-type tasks, and demonstrates robustness to speaker/content variation through embedding visualization. This framework enables room fingerprinting directly from speech and has potential applications in forensics and virtual/augmented reality, with future work focusing on real-room validation. All results are grounded in $d=768$-dimensional embeddings and a loss that encourages same-room embeddings to be similar despite spoken content or speaker differences.

Abstract

This paper focuses on room fingerprinting, a task involving the analysis of an audio recording to determine the specific volume and shape of the room in which it was captured. While it is relatively straightforward to determine the basic room parameters from the Room Impulse Responses (RIR), doing so from a speech signal is a cumbersome task. To address this challenge, we introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances. During pre-training, one encoder receives the RIR while the other processes the reverberant speech signal. A contrastive loss function is employed to embed the speech and the acoustic response jointly. In the fine-tuning stage, the specific classification task is trained. In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape classification. The proposed scheme is extensively evaluated using simulated acoustic environments.

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification

TL;DR

RevRIR addresses room fingerprinting from reverberant speech by learning a joint embedding space for reverberant speech and room impulse responses using a dual-encoder architecture with a symmetric contrastive loss. Pre-training aligns embeddings so room acoustics dominate, while fine-tuning attaches a linear classifier to one encoder for room-shape classification. The approach yields strong results on simulated data, with high accuracy for both 110-room and 3-room-type tasks, and demonstrates robustness to speaker/content variation through embedding visualization. This framework enables room fingerprinting directly from speech and has potential applications in forensics and virtual/augmented reality, with future work focusing on real-room validation. All results are grounded in -dimensional embeddings and a loss that encourages same-room embeddings to be similar despite spoken content or speaker differences.

Abstract

This paper focuses on room fingerprinting, a task involving the analysis of an audio recording to determine the specific volume and shape of the room in which it was captured. While it is relatively straightforward to determine the basic room parameters from the Room Impulse Responses (RIR), doing so from a speech signal is a cumbersome task. To address this challenge, we introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances. During pre-training, one encoder receives the RIR while the other processes the reverberant speech signal. A contrastive loss function is employed to embed the speech and the acoustic response jointly. In the fine-tuning stage, the specific classification task is trained. In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape classification. The proposed scheme is extensively evaluated using simulated acoustic environments.
Paper Structure (14 sections, 7 equations, 3 figures, 5 tables)

This paper contains 14 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: System overview. (A) Pre-training: Using contrastive learning, two separate encoders are trained on the reverberated audio and the RIR. (B) Fine-tuning: (1) freezing the reverberated speech encoder while training a classification head for the downstream task (e.g., classification to one of 110 rooms). In inference (2) a reverberated speech can be classified using the network.
  • Figure 2: (a) Training and (b) Validation losses during the pre-training stage.
  • Figure 3: Visualizing the embedding space of validation samples projected using t-SNE. The left, middle, and right graphs show the distribution of projected embeddings, colored by their ground truth width, depth, and height values, respectively. It is clearly observed that the features were projected according to the room dimension and that it is content- and speaker-independent.