VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno
TL;DR
This work introduces VoiceFilter, a speaker-conditioned spectrogram masking approach for targeted voice separation. By pairing a dedicated speaker encoder (GE2E-trained d-vectors) with a mask-based VoiceFilter network that incorporates the target speaker embedding, the system isolates the target voice from multi-speaker mixtures without requiring prior knowledge of the number of speakers. Evaluations on LibriSpeech and VCTK show substantial reductions in WER for noisy, multi-speaker inputs, with minimal degradation on clean speech, and SDR gains that surpass a permutation-invariant baseline. The approach demonstrates strong cross-dataset generalization and points to promising future directions, including larger-scale training, more speakers, and joint optimization with ASR or speech enhancement.
Abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
