SPGM: Prioritizing Local Features for enhanced speech separation performance
Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma
TL;DR
This work targets efficient speech separation by prioritizing local feature modeling. The authors replace the inter-blocks of a dual-path model with a Single-Path Global Modulation (SPGM) block, consisting of a parameter-free global pooling module and a lightweight modulation head, enabling a single-path architecture focused on local intra-block processing. SPGM achieves state-of-the-art-like performance on WSJ0-2Mix and Libri2Mix (SI-SDRi of 22.1 dB and 20.4 dB, respectively) with far fewer parameters than large baselines, and incurs negligible additional computation. The approach provides a robust, scalable foundation for future work in efficient speech separation, balancing global context with intense local modeling.
Abstract
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters. Model and weights are available at huggingface.co/yipjiaqi/spgm
