Table of Contents
Fetching ...

Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

Holger Severin Bovbjerg, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan

TL;DR

This work targets robust detection of target-speaker speech in noisy environments by combining a causal self-supervised pretraining stage, DN-APC, with a speaker-conditioned TS-VAD model. DN-APC pretrained Conformer-based TS-VAD encoders yield about a 2% gain in average performance for both seen and unseen noise, particularly boosting speech-vs-noise discrimination. The study systematically compares five speaker-conditioning methods (including FiLM and embedding preprocessing) and finds FiLM generally offers the best overall outcomes, with multiplication excelling for target-speech detection; embedding preprocessing can provide marginal gains. Representation analysis via t-SNE shows DN-APC pretraining captures robust speech vs non-speech structure prior to fine-tuning, underscoring SSL’s value for noise-robust TS-VAD systems and potential gains in real-world diarization and ASR pipelines.

Abstract

Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.

Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

TL;DR

This work targets robust detection of target-speaker speech in noisy environments by combining a causal self-supervised pretraining stage, DN-APC, with a speaker-conditioned TS-VAD model. DN-APC pretrained Conformer-based TS-VAD encoders yield about a 2% gain in average performance for both seen and unseen noise, particularly boosting speech-vs-noise discrimination. The study systematically compares five speaker-conditioning methods (including FiLM and embedding preprocessing) and finds FiLM generally offers the best overall outcomes, with multiplication excelling for target-speech detection; embedding preprocessing can provide marginal gains. Representation analysis via t-SNE shows DN-APC pretraining captures robust speech vs non-speech structure prior to fine-tuning, underscoring SSL’s value for noise-robust TS-VAD systems and potential gains in real-world diarization and ASR pipelines.

Abstract

Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.
Paper Structure (28 sections, 17 equations, 4 figures, 5 tables)

This paper contains 28 sections, 17 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of the DN-APC framework also used in bovbjerg_ssl_pvad.
  • Figure 2: Histograms showing the distributions of the concatenated LibriSpeech training and test set utterances.
  • Figure 3: Various TS-VAD model setups for speaker conditioning. Blue boxes designate the pretrained encoder, and green boxes represent elements related to speaker conditioning.
  • Figure 4: Hidden representation tSNE plot of hidden representations of speech in mixed noise at various dB levels and in clean speech. The first column shows the supervised baseline, the second column shows the DN-APC encoder after pretraining, and the third column shows the DN-APC model after fine-tuning for TS-VAD using FiLM conditioning.