A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

Pu-Yun Kow; Pu-Zhao Kow

A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

Pu-Yun Kow, Pu-Zhao Kow

TL;DR

The paper tackles speech enhancement as an inverse problem under strict real-time constraints by proposing FFT-ConvAE, a lightweight method that uses the discrete Fourier transform magnitudes and a Convolutional Autoencoder to learn spectral gains $A_j$. By formulating $\hat{x}_{j}^{\rm approx} = A_j \hat{x}_{j}^{\rm deg} / |\hat{x}_{j}^{\rm deg}|$ and training a ConvAE with linear activations, the approach achieves efficient performance with a real-time factor $\mathrm{RTF} \le 1$ on 16 kHz speech. Evaluated on the Helsinki Speech Challenge 2024 data, it secures second place in Task 1, demonstrating the viability of neural-network-free, magnitude-domain strategies for reconstruction in less ill-posed settings, while revealing limitations for tasks with strong reverberation. The work highlights that simple spectral schemes can be competitive under practical constraints, offering lightweight alternatives to deep learning in real-time speech processing.

Abstract

This paper addresses the reconstruction of audio signals from degraded measurements. We propose a lightweight model that combines the discrete Fourier transform with a Convolutional Autoencoder (FFT-ConvAE), which enabled our team to achieve second place in the Helsinki Speech Challenge 2024. Our results, together with those of other teams, demonstrate the potential of neural-network-free approaches for effective speech signal reconstruction.

A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

TL;DR

. By formulating

and training a ConvAE with linear activations, the approach achieves efficient performance with a real-time factor

on 16 kHz speech. Evaluated on the Helsinki Speech Challenge 2024 data, it secures second place in Task 1, demonstrating the viability of neural-network-free, magnitude-domain strategies for reconstruction in less ill-posed settings, while revealing limitations for tasks with strong reverberation. The work highlights that simple spectral schemes can be competitive under practical constraints, offering lightweight alternatives to deep learning in real-time speech processing.

Abstract

Paper Structure (9 sections, 21 equations, 8 figures, 2 tables)

This paper contains 9 sections, 21 equations, 8 figures, 2 tables.

Introduction
Speech enchancement as an inverse problem
Methodology
Overview of the HSC2024
Parameter Setting and Training
Results
Discussions
Conclusions
Stability and Instability Mechanisms in Inverse Problems

Figures (8)

Figure 3.1: Model architecture of the FFT-ConvAE Model
Figure 5.1: The performance of training
Figure 5.2: Task 1 Level 1: Blue shows the Fourier magnitudes of the clean signal. Red indicates (i) the filtered signal and (ii) the trained signal
Figure 5.3: Samples #16 and #516 in Task 1 Level 4: Blue shows the Fourier magnitude of the clean signal. Red indicates (i) the Fourier magnitude of the filtered signal and (ii) the Fourier magnitude of the trained signal.
Figure 5.4: Spectrogram, texts transcribed by evaluate.py and CER of (a) Sample # 11 and (b) Sample # 101 in Task 1 Level 1
...and 3 more figures

Theorems & Definitions (1)

Remark 2.1

A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

TL;DR

Abstract

A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)