RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification
Jacob Bitterman, Daniel Levi, Hilel Hagai Diamandi, Sharon Gannot, Tal Rosenwein
TL;DR
RevRIR addresses room fingerprinting from reverberant speech by learning a joint embedding space for reverberant speech and room impulse responses using a dual-encoder architecture with a symmetric contrastive loss. Pre-training aligns embeddings so room acoustics dominate, while fine-tuning attaches a linear classifier to one encoder for room-shape classification. The approach yields strong results on simulated data, with high accuracy for both 110-room and 3-room-type tasks, and demonstrates robustness to speaker/content variation through embedding visualization. This framework enables room fingerprinting directly from speech and has potential applications in forensics and virtual/augmented reality, with future work focusing on real-room validation. All results are grounded in $d=768$-dimensional embeddings and a loss that encourages same-room embeddings to be similar despite spoken content or speaker differences.
Abstract
This paper focuses on room fingerprinting, a task involving the analysis of an audio recording to determine the specific volume and shape of the room in which it was captured. While it is relatively straightforward to determine the basic room parameters from the Room Impulse Responses (RIR), doing so from a speech signal is a cumbersome task. To address this challenge, we introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances. During pre-training, one encoder receives the RIR while the other processes the reverberant speech signal. A contrastive loss function is employed to embed the speech and the acoustic response jointly. In the fine-tuning stage, the specific classification task is trained. In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape classification. The proposed scheme is extensively evaluated using simulated acoustic environments.
