Table of Contents
Fetching ...

Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

Runwu Shi, Zirui Lin, Benjamin Yen, Jiang Wang, Ragib Amin Nihal, Kazuhiro Nakadai

TL;DR

This work tackles single-channel target speech extraction by leveraging distance cues augmented with room environmental information. The authors design a TF-domain model with learnable distance and room embeddings, including a query embedding generator and query/basic blocks, to extract target speech when the query distance matches speaker locations. Experiments on simulated and real room impulse responses show that incorporating room clues improves generalization and, with finetuning on real data, substantially boosts performance. The results demonstrate the feasibility and practical potential of distance-based TSE with room-aware embeddings for environments where enrolled speaker clues are unavailable.

Abstract

This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available at https://runwushi.github.io/distance-room-demo-page/.

Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

TL;DR

This work tackles single-channel target speech extraction by leveraging distance cues augmented with room environmental information. The authors design a TF-domain model with learnable distance and room embeddings, including a query embedding generator and query/basic blocks, to extract target speech when the query distance matches speaker locations. Experiments on simulated and real room impulse responses show that incorporating room clues improves generalization and, with finetuning on real data, substantially boosts performance. The results demonstrate the feasibility and practical potential of distance-based TSE with room-aware embeddings for environments where enrolled speaker clues are unavailable.

Abstract

This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available at https://runwushi.github.io/distance-room-demo-page/.

Paper Structure

This paper contains 20 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Target speech extraction using distance and room clues.
  • Figure 2: Flowchart of the proposed method showing (a) Overall structure, (b) Structure of Query block, (c) Structure of Query embedding generator.
  • Figure 3: RIR recording locations and environment.