Table of Contents
Fetching ...

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

Manuel Milling, Shuo Liu, Andreas Triantafyllopoulos, Ilhan Aslan, Björn W. Schuller

TL;DR

The paper tackles the robustness gap of neural audio systems under real-world noise by proposing an end-to-end framework that jointly optimises an audio enhancement front-end and downstream computer audition models. It introduces an iterative training paradigm that uses sample-wise importance, derived from downstream task losses, to focus AE training on harder examples and to adapt the CA model to enhanced signals. Across SCR, ASR, SER, and ASC, the iterative approach consistently outperforms baselines and standard data augmentation, achieving the largest gains at low SNRs and demonstrating strong cross-task robustness. The work highlights the benefits of task-specific AE and suggests future directions including integration with self-supervised learning to further boost performance and generalisation.

Abstract

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

TL;DR

The paper tackles the robustness gap of neural audio systems under real-world noise by proposing an end-to-end framework that jointly optimises an audio enhancement front-end and downstream computer audition models. It introduces an iterative training paradigm that uses sample-wise importance, derived from downstream task losses, to focus AE training on harder examples and to adapt the CA model to enhanced signals. Across SCR, ASR, SER, and ASC, the iterative approach consistently outperforms baselines and standard data augmentation, achieving the largest gains at low SNRs and demonstrating strong cross-task robustness. The work highlights the benefits of task-specific AE and suggests future directions including integration with self-supervised learning to further boost performance and generalisation.

Abstract

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.
Paper Structure (19 sections, 8 equations, 4 figures, 5 tables)

This paper contains 19 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Diagrams showing the methodologies used. The red arrows demonstrate the back-propagation through the network modules with respect to the losses $L$ of the AE and the CAT. In a) only the CAT loss is optimised with a frozen AE, whilst the optimisation in b) is based on the CAT and the AE loss with the AE parameters being affected through both losses. In our suggested approach c) the parameters of the CAT and the AE are only affected through their respective loss with the AE including a sample-level importance in contrast to the previous approaches.
  • Figure 2: Schematic diagram of the U-net architecture. The raw audio is transformed with a short-time Fourier transform (STFT) into a spectrogram, which is then fed into a fully convolutional network with an encoder and decoder and skip connections between corresponding encoder and decoder layers in the U-shaped architecture. The final reconstructed or enhanced spectrogram is then transformed back into a raw audio signal with an inverse STFT.
  • Figure 3: Schematic diagrams showing architectures for downstream computer audition tasks. The architecture for speech command recognition in a) is the only one acting on the raw audio signal compared to the architectures of b) to d), which take 2-dimensional spectrogram representations of the audio signal as an input. In a) we apply 1D convolutional and maxpooling layers, prior to a global average pooling and classification layer. The automatic speech emotion recognition model depicted in b) consists of 2D convolutional layers and convolutional blocks with skip connections, followed by a layer normalisation and a bi-directional GRU-RNN layer prior to the classification layer. The Speech emotion recognition architecture in c) only applies convolutional blocks prior to a global average pooling and a classification layer. The audio scene classification model in d) concatenates the outputs of different convolutional blocks and works in a fully convolutional manner.
  • Figure 4: Spectrograms for visualisation of audio enhancement samples. The left column displays the clean target samples, the middle column contains the artificially added noisy sample and the last column represents the reconstructed (denoised) audio. In the first row (audio scene classification) the clean sample is a sample of music and the considered noise is a speech sample, while in the second row (automatic speech recognition), the clean sample is a speech sample and the noise originates from a construction site.