Table of Contents
Fetching ...

Resource-Efficient Separation Transformer

Luca Della Libera, Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

TL;DR

This paper addresses the high computational cost of Transformer-based speech separation by introducing the Resource-Efficient Separation Transformer (RE-SepFormer), which processes non-overlapping latent chunks and uses a Memory Transformer operating on chunk summaries to capture long-range dependencies. The approach reduces parameters by about 3x and MACs by about 11x relative to SepFormer, while maintaining competitive separation performance, achieving $SDRi$ near 19 dB on WSJ0-2Mix and strong results on WHAM! in both causal and non-causal modes. Empirical results show RE-SepFormer scales better in memory and inference time, offering substantial gains for long mixtures and on-device, real-time applications, with high parallelizability due to its feed-forward-centric architecture. The work positions RE-SepFormer as a practical, efficient alternative for on-device speech separation without sacrificing major performance, enabling deployment in GPU-enabled mobile devices and similar platforms.

Abstract

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

Resource-Efficient Separation Transformer

TL;DR

This paper addresses the high computational cost of Transformer-based speech separation by introducing the Resource-Efficient Separation Transformer (RE-SepFormer), which processes non-overlapping latent chunks and uses a Memory Transformer operating on chunk summaries to capture long-range dependencies. The approach reduces parameters by about 3x and MACs by about 11x relative to SepFormer, while maintaining competitive separation performance, achieving near 19 dB on WSJ0-2Mix and strong results on WHAM! in both causal and non-causal modes. Empirical results show RE-SepFormer scales better in memory and inference time, offering substantial gains for long mixtures and on-device, real-time applications, with high parallelizability due to its feed-forward-centric architecture. The work positions RE-SepFormer as a practical, efficient alternative for on-device speech separation without sacrificing major performance, enabling deployment in GPU-enabled mobile devices and similar platforms.

Abstract

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.
Paper Structure (16 sections, 2 equations, 3 figures, 4 tables)

This paper contains 16 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An high-level description of the masking-based source separation pipeline: the encoder learns a latent representation $h$ from the input mixture $x$. The masking network then estimates the optimal masks $m_1$ and $m_2$ to separate the sources in the mixture. Finally, the decoder reconstructs the sources from the masked representations.
  • Figure 2: (Top) The architecture of the masking network. (Bottom) The Resource-Efficient SepFormer module: (1) the latent representation $h$ is chunked to get $h_0'$, $h_1'$, $\dots$, $h_{N_c}'$ (2) the IntraTransformer is applied to all of the chunks independently (3) the output is averaged over the time dimension and passed through the memory Transformer (4) the resulting vector is added to the output of the IntraTransformer with broadcasting over the time axis (5) the resulting tensor is passed through another IntraTransformer to obtain the final output $h"$.
  • Figure 3: Memory in GB (left panel) and inference time in seconds (right panel) comparison of RE-SepFormer, SkiM and SepFormer-Light. The x-axis in both panels shows the length of the input signal in seconds (8 kHz sampling rate).