Table of Contents
Fetching ...

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

TL;DR

This work introduces VoiceFilter, a speaker-conditioned spectrogram masking approach for targeted voice separation. By pairing a dedicated speaker encoder (GE2E-trained d-vectors) with a mask-based VoiceFilter network that incorporates the target speaker embedding, the system isolates the target voice from multi-speaker mixtures without requiring prior knowledge of the number of speakers. Evaluations on LibriSpeech and VCTK show substantial reductions in WER for noisy, multi-speaker inputs, with minimal degradation on clean speech, and SDR gains that surpass a permutation-invariant baseline. The approach demonstrates strong cross-dataset generalization and points to promising future directions, including larger-scale training, more speakers, and joint optimization with ASR or speech enhancement.

Abstract

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

TL;DR

This work introduces VoiceFilter, a speaker-conditioned spectrogram masking approach for targeted voice separation. By pairing a dedicated speaker encoder (GE2E-trained d-vectors) with a mask-based VoiceFilter network that incorporates the target speaker embedding, the system isolates the target voice from multi-speaker mixtures without requiring prior knowledge of the number of speakers. Evaluations on LibriSpeech and VCTK show substantial reductions in WER for noisy, multi-speaker inputs, with minimal degradation on clean speech, and SDR gains that surpass a permutation-invariant baseline. The approach demonstrates strong cross-dataset generalization and points to promising future directions, including larger-scale training, more speakers, and joint optimization with ASR or speech enhancement.

Abstract

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

Paper Structure

This paper contains 17 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: System architecture.
  • Figure 2: Input data processing workflow.