Single-Channel Target Speech Extraction Utilizing Distance and Room Clues
Runwu Shi, Zirui Lin, Benjamin Yen, Jiang Wang, Ragib Amin Nihal, Kazuhiro Nakadai
TL;DR
This work tackles single-channel target speech extraction by leveraging distance cues augmented with room environmental information. The authors design a TF-domain model with learnable distance and room embeddings, including a query embedding generator and query/basic blocks, to extract target speech when the query distance matches speaker locations. Experiments on simulated and real room impulse responses show that incorporating room clues improves generalization and, with finetuning on real data, substantially boosts performance. The results demonstrate the feasibility and practical potential of distance-based TSE with room-aware embeddings for environments where enrolled speaker clues are unavailable.
Abstract
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available at https://runwushi.github.io/distance-room-demo-page/.
