Table of Contents
Fetching ...

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

TL;DR

TF-Locoformer tackles the challenge of achieving state-of-the-art speech separation in the TF-domain without recurrent networks by introducing a Transformer-based architecture that couples global self-attention with local convolutional processing. It augments the Transformer block with ConvSwiGLU FFNs placed before and after MHSA and introduces RMSGroupNorm to improve TF-domain dual-path disentanglement. Across WSJ0-2mix, Libri2Mix, WHAMR!, and DNS benchmarks, it matches or surpasses SoTA results, especially under reverberant conditions, and ablation studies confirm the importance of local modeling and normalization choices. The work demonstrates that TF-domain models can offer robust, scalable performance for speech separation and enhancement without relying on RNNs, with potential extensions to music and general sound separation.

Abstract

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

TL;DR

TF-Locoformer tackles the challenge of achieving state-of-the-art speech separation in the TF-domain without recurrent networks by introducing a Transformer-based architecture that couples global self-attention with local convolutional processing. It augments the Transformer block with ConvSwiGLU FFNs placed before and after MHSA and introduces RMSGroupNorm to improve TF-domain dual-path disentanglement. Across WSJ0-2mix, Libri2Mix, WHAMR!, and DNS benchmarks, it matches or surpasses SoTA results, especially under reverberant conditions, and ablation studies confirm the importance of local modeling and normalization choices. The work demonstrates that TF-domain models can offer robust, scalable performance for speech separation and enhancement without relying on RNNs, with potential extensions to music and general sound separation.

Abstract

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.
Paper Structure (11 sections, 4 equations, 3 figures, 7 tables)

This paper contains 11 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the proposed TF-Locoformer. The temporal modeling block is the same as the frequency modeling block with a permutation of the time and frequency dimensions.
  • Figure 2: Box-plots of SI-SNRi [dB] for models with different sizes on WSJ0-2mix test set. The numbers below the model size indicate average and standard deviations of SI-SNRi.
  • Figure 3: Average SI-SNRi with different kernel sizes on WSJ0-2mix test set. Medium model is shown.