Table of Contents
Fetching ...

xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement

Nikolai Lund Kühne, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan

TL;DR

This work addresses single-channel speech enhancement by replacing attention-based blocks with xLSTM blocks to achieve scalable sequence processing. The authors introduce xLSTM-SENet, an encoder-decoder SE system that jointly denoises magnitude and wrapped phase by integrating TF-xLSTM blocks into the MP-SENet framework, leveraging exponential gating and matrix memory for improved storage and revision. Through extensive ablations and comparisons, the study shows that xLSTM-based models, and even traditional LSTMs under certain configurations, can rival state-of-the-art Mamba- and Conformer-based systems on VoiceBank+Demand, with xLSTM-SENet2 delivering the best performance on VoiceBank+DEMAND. The results suggest that scalable, memory-rich recurrent architectures, properly gated and bidirectional, are competitive for SE and can outperform complex attention-based models at similar complexity, with practical implications for real-time and low-resource hearing aid applications. The work also provides code to facilitate replication and further exploration of xLSTM-based SE.

Abstract

While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length. In contrast, the recently proposed Extended Long Short-Term Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based models remain unexplored for speech enhancement. This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A comparative analysis reveals that xLSTM-and notably, even LSTM-can match or outperform state-of-the-art Mamba- and Conformer-based systems across various model sizes in speech enhancement on the VoiceBank+Demand dataset. Through ablation studies, we identify key architectural design choices such as exponential gating and bidirectionality contributing to its effectiveness. Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems of similar complexity on the Voicebank+DEMAND dataset.

xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement

TL;DR

This work addresses single-channel speech enhancement by replacing attention-based blocks with xLSTM blocks to achieve scalable sequence processing. The authors introduce xLSTM-SENet, an encoder-decoder SE system that jointly denoises magnitude and wrapped phase by integrating TF-xLSTM blocks into the MP-SENet framework, leveraging exponential gating and matrix memory for improved storage and revision. Through extensive ablations and comparisons, the study shows that xLSTM-based models, and even traditional LSTMs under certain configurations, can rival state-of-the-art Mamba- and Conformer-based systems on VoiceBank+Demand, with xLSTM-SENet2 delivering the best performance on VoiceBank+DEMAND. The results suggest that scalable, memory-rich recurrent architectures, properly gated and bidirectional, are competitive for SE and can outperform complex attention-based models at similar complexity, with practical implications for real-time and low-resource hearing aid applications. The work also provides code to facilitate replication and further exploration of xLSTM-based SE.

Abstract

While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length. In contrast, the recently proposed Extended Long Short-Term Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based models remain unexplored for speech enhancement. This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A comparative analysis reveals that xLSTM-and notably, even LSTM-can match or outperform state-of-the-art Mamba- and Conformer-based systems across various model sizes in speech enhancement on the VoiceBank+Demand dataset. Through ablation studies, we identify key architectural design choices such as exponential gating and bidirectionality contributing to its effectiveness. Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems of similar complexity on the Voicebank+DEMAND dataset.
Paper Structure (16 sections, 3 equations, 2 figures, 4 tables)

This paper contains 16 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overall structure of our proposed xLSTM-SENet with parallel magnitude and phase spectra denoising.
  • Figure 2: Scaling results on the VoiceBank+Demand dataset. The smallest ($N=1$) and largest ($N=6$) models are $1.37M$ and $2.94M$ parameters, respectively.