Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor
Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe
TL;DR
This work tackles monaural speech separation when the number of speakers $C$ is unknown. It introduces SepTDA, a time-domain encoder-decoder model that integrates dual-path and triple-path processing with a Transformer decoder-based attractor calculation module to estimate $C$-dependent attractors from a fixed set of $C+1$ speaker queries, followed by FiLM conditioning and inter-speaker refinement. The approach achieves state-of-the-art SI-SDRi on WSJ0-2mix and WSJ0-3/4/5mix benchmarks, demonstrating strong generalization to mixtures with up to $5$ speakers and robust counting under unknown-$C$ scenarios. The proposed combination of LSTM-attention blocks, FiLM conditioning, and attractor-based separation offers a scalable and effective pathway for flexible, high-quality monaural speech separation in real-world settings.
Abstract
We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.
