Table of Contents
Fetching ...

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Renana Opochinsky, Mordehay Moradi, Sharon Gannot

TL;DR

The paper tackles single-microphone speaker separation under noisy and reverberant conditions, with a focus on real-time robot audition. It introduces Sep-TFAnet, a TF-attention enhanced separation network that uses STFT/iSTFT instead of a learned encoder/decoder, along with Sep-TFAnetVAD, which jointly trains a VAD for speaker activity. Through extensive simulations, real-world robot recordings, and an online-mode evaluation, the method demonstrates competitive SI-SDR and substantial WER improvements over baselines, while generalizing to realistic data beyond synthetic corpora. The embedded VAD provides robust activity detection and enables practical downstream processing, such as post-filtering and multi-microphone beamforming, enhancing real-world applicability in human-robot interaction scenarios.

Abstract

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}^{\text{VAD}}$, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}^{\text{VAD}}$ and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

TL;DR

The paper tackles single-microphone speaker separation under noisy and reverberant conditions, with a focus on real-time robot audition. It introduces Sep-TFAnet, a TF-attention enhanced separation network that uses STFT/iSTFT instead of a learned encoder/decoder, along with Sep-TFAnetVAD, which jointly trains a VAD for speaker activity. Through extensive simulations, real-world robot recordings, and an online-mode evaluation, the method demonstrates competitive SI-SDR and substantial WER improvements over baselines, while generalizing to realistic data beyond synthetic corpora. The embedded VAD provides robust activity detection and enables practical downstream processing, such as post-filtering and multi-microphone beamforming, enhancing real-world applicability in human-robot interaction scenarios.

Abstract

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed , which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io
Paper Structure (18 sections, 5 equations, 13 figures, 3 tables)

This paper contains 18 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Sep-TFAnetVAD architecture. Learnable blocks are depicted in orange, and data blocks are in blue.
  • Figure 2: Separation module
  • Figure 3: 1-D AttConv
  • Figure 4: TF attention layer
  • Figure 6: Recording setup with ARI at BIU acoustic lab. ARI was positioned at the center of the acoustic lab, with a set of loudspeakers in front of it, on two semi-circles with approximately 1 m and 2 m radius, respectively. In our experiments, we only used the inner semi-circle with five loudspeakers positioned at $[-65, -30, 0, 30, 65]^\circ$. The speech signals were simultaneously played from two randomly chosen loudspeakers. The lab's computer automatically controlled the entire scenario.
  • ...and 8 more figures