Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Renana Opochinsky; Mordehay Moradi; Sharon Gannot

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Renana Opochinsky, Mordehay Moradi, Sharon Gannot

TL;DR

The paper tackles single-microphone speaker separation under noisy and reverberant conditions, with a focus on real-time robot audition. It introduces Sep-TFAnet, a TF-attention enhanced separation network that uses STFT/iSTFT instead of a learned encoder/decoder, along with Sep-TFAnetVAD, which jointly trains a VAD for speaker activity. Through extensive simulations, real-world robot recordings, and an online-mode evaluation, the method demonstrates competitive SI-SDR and substantial WER improvements over baselines, while generalizing to realistic data beyond synthetic corpora. The embedded VAD provides robust activity detection and enables practical downstream processing, such as post-filtering and multi-microphone beamforming, enhancing real-world applicability in human-robot interaction scenarios.

Abstract

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}^{\text{VAD}}$, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}^{\text{VAD}}$ and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

TL;DR

Abstract

, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of

and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io

Paper Structure (18 sections, 5 equations, 13 figures, 3 tables)

This paper contains 18 sections, 5 equations, 13 figures, 3 tables.

Introduction
Problem Formulation
Proposed Model
Separation Module
Online Mode
VAD network
Objective Functions
Experimental Study
Datasets
Robot Interaction: Experimental Setup
Training Procedure
Baseline Methods
Experimental Results
Separation with Simulated Data
Separation with Data Recorded on ARI Robot
...and 3 more sections

Figures (13)

Figure 1: Sep-TFAnetVAD architecture. Learnable blocks are depicted in orange, and data blocks are in blue.
Figure 2: Separation module
Figure 3: 1-D AttConv
Figure 4: TF attention layer
Figure 6: Recording setup with ARI at BIU acoustic lab. ARI was positioned at the center of the acoustic lab, with a set of loudspeakers in front of it, on two semi-circles with approximately 1 m and 2 m radius, respectively. In our experiments, we only used the inner semi-circle with five loudspeakers positioned at $[-65, -30, 0, 30, 65]^\circ$. The speech signals were simultaneously played from two randomly chosen loudspeakers. The lab's computer automatically controlled the entire scenario.
...and 8 more figures

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

TL;DR

Abstract

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (13)