Table of Contents
Fetching ...

A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

Pu-Yun Kow, Pu-Zhao Kow

TL;DR

The paper tackles speech enhancement as an inverse problem under strict real-time constraints by proposing FFT-ConvAE, a lightweight method that uses the discrete Fourier transform magnitudes and a Convolutional Autoencoder to learn spectral gains $A_j$. By formulating $\hat{x}_{j}^{\rm approx} = A_j \hat{x}_{j}^{\rm deg} / |\hat{x}_{j}^{\rm deg}|$ and training a ConvAE with linear activations, the approach achieves efficient performance with a real-time factor $\mathrm{RTF} \le 1$ on 16 kHz speech. Evaluated on the Helsinki Speech Challenge 2024 data, it secures second place in Task 1, demonstrating the viability of neural-network-free, magnitude-domain strategies for reconstruction in less ill-posed settings, while revealing limitations for tasks with strong reverberation. The work highlights that simple spectral schemes can be competitive under practical constraints, offering lightweight alternatives to deep learning in real-time speech processing.

Abstract

This paper addresses the reconstruction of audio signals from degraded measurements. We propose a lightweight model that combines the discrete Fourier transform with a Convolutional Autoencoder (FFT-ConvAE), which enabled our team to achieve second place in the Helsinki Speech Challenge 2024. Our results, together with those of other teams, demonstrate the potential of neural-network-free approaches for effective speech signal reconstruction.

A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

TL;DR

The paper tackles speech enhancement as an inverse problem under strict real-time constraints by proposing FFT-ConvAE, a lightweight method that uses the discrete Fourier transform magnitudes and a Convolutional Autoencoder to learn spectral gains . By formulating and training a ConvAE with linear activations, the approach achieves efficient performance with a real-time factor on 16 kHz speech. Evaluated on the Helsinki Speech Challenge 2024 data, it secures second place in Task 1, demonstrating the viability of neural-network-free, magnitude-domain strategies for reconstruction in less ill-posed settings, while revealing limitations for tasks with strong reverberation. The work highlights that simple spectral schemes can be competitive under practical constraints, offering lightweight alternatives to deep learning in real-time speech processing.

Abstract

This paper addresses the reconstruction of audio signals from degraded measurements. We propose a lightweight model that combines the discrete Fourier transform with a Convolutional Autoencoder (FFT-ConvAE), which enabled our team to achieve second place in the Helsinki Speech Challenge 2024. Our results, together with those of other teams, demonstrate the potential of neural-network-free approaches for effective speech signal reconstruction.
Paper Structure (9 sections, 21 equations, 8 figures, 2 tables)

This paper contains 9 sections, 21 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 3.1: Model architecture of the FFT-ConvAE Model
  • Figure 5.1: The performance of training
  • Figure 5.2: Task 1 Level 1: Blue shows the Fourier magnitudes of the clean signal. Red indicates (i) the filtered signal and (ii) the trained signal
  • Figure 5.3: Samples #16 and #516 in Task 1 Level 4: Blue shows the Fourier magnitude of the clean signal. Red indicates (i) the Fourier magnitude of the filtered signal and (ii) the Fourier magnitude of the trained signal.
  • Figure 5.4: Spectrogram, texts transcribed by evaluate.py and CER of (a) Sample # 11 and (b) Sample # 101 in Task 1 Level 1
  • ...and 3 more figures

Theorems & Definitions (1)

  • Remark 2.1