A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder
Pu-Yun Kow, Pu-Zhao Kow
TL;DR
The paper tackles speech enhancement as an inverse problem under strict real-time constraints by proposing FFT-ConvAE, a lightweight method that uses the discrete Fourier transform magnitudes and a Convolutional Autoencoder to learn spectral gains $A_j$. By formulating $\hat{x}_{j}^{\rm approx} = A_j \hat{x}_{j}^{\rm deg} / |\hat{x}_{j}^{\rm deg}|$ and training a ConvAE with linear activations, the approach achieves efficient performance with a real-time factor $\mathrm{RTF} \le 1$ on 16 kHz speech. Evaluated on the Helsinki Speech Challenge 2024 data, it secures second place in Task 1, demonstrating the viability of neural-network-free, magnitude-domain strategies for reconstruction in less ill-posed settings, while revealing limitations for tasks with strong reverberation. The work highlights that simple spectral schemes can be competitive under practical constraints, offering lightweight alternatives to deep learning in real-time speech processing.
Abstract
This paper addresses the reconstruction of audio signals from degraded measurements. We propose a lightweight model that combines the discrete Fourier transform with a Convolutional Autoencoder (FFT-ConvAE), which enabled our team to achieve second place in the Helsinki Speech Challenge 2024. Our results, together with those of other teams, demonstrate the potential of neural-network-free approaches for effective speech signal reconstruction.
