Table of Contents
Fetching ...

SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns

Yongjoon Lee, Jung-Woo Choi

Abstract

General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.

SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns

Abstract

General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.
Paper Structure (24 sections, 4 equations, 5 figures, 8 tables)

This paper contains 24 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Various TFDP processing methods. ds and us denote downsampling and upsampling, respectively.
  • Figure 2: Overall architecture of SEMamba++ (a) and Frequency GLP (b). In (a), "Mag.", "Comp.", and "Decomp." refer to magnitude, compression, and decompression, respectively. "Conv" and "TrConv" indicate the 2d convolution and transposed convolution that down- and upsample along the frequency axis. (b) describes Frequency GLP with frequency dimension $F'$.
  • Figure 3: Gradient visualization of outputs from different branches in the proposed multi-branch method. (a) and (b) represent the magnitude spectrogram of the clean and the degraded speech, respectively. Resolution-wise visualizations of the gradient-weight magnitude spectrogram are illustrated in (1), (2), and (3). The top resolution has a frequency dimension equal to $F'$.
  • Figure 4: The ratio of gradient norms under different degradation types and intensity. $\mathcal{R} > 1$ indicates the larger contribution of the Global Periodicity module than the Local module.
  • Figure 5: Softplus functions with the learnable $\beta$ for each frequency band. The black dotted line denotes the ReLU function.