Table of Contents
Fetching ...

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

TL;DR

This work tackles monaural speech separation when the number of speakers $C$ is unknown. It introduces SepTDA, a time-domain encoder-decoder model that integrates dual-path and triple-path processing with a Transformer decoder-based attractor calculation module to estimate $C$-dependent attractors from a fixed set of $C+1$ speaker queries, followed by FiLM conditioning and inter-speaker refinement. The approach achieves state-of-the-art SI-SDRi on WSJ0-2mix and WSJ0-3/4/5mix benchmarks, demonstrating strong generalization to mixtures with up to $5$ speakers and robust counting under unknown-$C$ scenarios. The proposed combination of LSTM-attention blocks, FiLM conditioning, and attractor-based separation offers a scalable and effective pathway for flexible, high-quality monaural speech separation in real-world settings.

Abstract

We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

TL;DR

This work tackles monaural speech separation when the number of speakers is unknown. It introduces SepTDA, a time-domain encoder-decoder model that integrates dual-path and triple-path processing with a Transformer decoder-based attractor calculation module to estimate -dependent attractors from a fixed set of speaker queries, followed by FiLM conditioning and inter-speaker refinement. The approach achieves state-of-the-art SI-SDRi on WSJ0-2mix and WSJ0-3/4/5mix benchmarks, demonstrating strong generalization to mixtures with up to speakers and robust counting under unknown- scenarios. The proposed combination of LSTM-attention blocks, FiLM conditioning, and attractor-based separation offers a scalable and effective pathway for flexible, high-quality monaural speech separation in real-world settings.

Abstract

We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.
Paper Structure (18 sections, 8 equations, 4 figures, 3 tables)

This paper contains 18 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: High-level system illustration of SepTDA.
  • Figure 2: Architectures of proposed (a) separator; (b) dual-path block; and (c) LSTM-attention block for intra- and inter-chunk processing
  • Figure 3: Transformer decoder-based attractor calculation module.
  • Figure 4: Triple-path processing block.