SPGM: Prioritizing Local Features for enhanced speech separation performance

Jia Qi Yip; Shengkui Zhao; Yukun Ma; Chongjia Ni; Chong Zhang; Hao Wang; Trung Hieu Nguyen; Kun Zhou; Dianwen Ng; Eng Siong Chng; Bin Ma

SPGM: Prioritizing Local Features for enhanced speech separation performance

Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

TL;DR

This work targets efficient speech separation by prioritizing local feature modeling. The authors replace the inter-blocks of a dual-path model with a Single-Path Global Modulation (SPGM) block, consisting of a parameter-free global pooling module and a lightweight modulation head, enabling a single-path architecture focused on local intra-block processing. SPGM achieves state-of-the-art-like performance on WSJ0-2Mix and Libri2Mix (SI-SDRi of 22.1 dB and 20.4 dB, respectively) with far fewer parameters than large baselines, and incurs negligible additional computation. The approach provides a robust, scalable foundation for future work in efficient speech separation, balancing global context with intense local modeling.

Abstract

Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters. Model and weights are available at huggingface.co/yipjiaqi/spgm

SPGM: Prioritizing Local Features for enhanced speech separation performance

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 4 figures, 2 tables)

This paper contains 14 sections, 1 equation, 4 figures, 2 tables.

Introduction
Methodology
Model Architecture
Single Path Global Modulation
Pooling Methods
Experiments
Datasets
Model Configuration
Training Parameters
Results
Effectiveness of the SPGM block
Comparison with Recent Models
Conclusion
Acknowledgements

Figures (4)

Figure 1: Overview of the proposed SPGM model. The SPGM block (yellow box) replaces an inter-block and represents our key contribution.
Figure 2: The SPGM block consists of the global pooling module (orange) and the modulation module (blue). K is the number of time steps, S is the number of chunks and N is the embedding size. Refer to Equation \ref{['eqn:modulationEqn']} for the detailed implementation of the modulation module.
Figure 3: Illustration of the change in dimensions across the chunk and inter pooling process in the global pooling module.
Figure 4: Illustration of the last element selection (LE) pooling method using a chunk size of 4 with a 50% overlap on a single channel. LE selects the last element of each chunk as the global vector while the remaining features are not used to derive the global embedding.

SPGM: Prioritizing Local Features for enhanced speech separation performance

TL;DR

Abstract

SPGM: Prioritizing Local Features for enhanced speech separation performance

Authors

TL;DR

Abstract

Table of Contents

Figures (4)