Table of Contents
Fetching ...

Should Audio Front-ends be Adaptive? Comparing Learnable and Adaptive Front-ends

Qiquan Zhang, Buddhi Wickramasinghe, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li

TL;DR

To address non-adaptive front-ends, the paper proposes Ada-FE, a Gabor-based front-end with a fixed first layer and an adaptive second layer whose Q-factor is updated frame-by-frame via two modules: a level-dependent adaptation and a neural adaptive feedback controller. The method is compared against several lightweight learnable front-ends across two backbones on eight audio and speech benchmarks, including speech, sound events, and music tasks. Results show Ada-FE consistently outperforms state-of-the-art learnable front-ends and demonstrates strong robustness to test-time variations, with the simplified Ada-FE-S-FM achieving competitive or better accuracy while using fewer hand-crafted components. The findings suggest adaptive front-ends offer meaningful gains in real-world environments and open avenues for integrating adaptivity with audio representation learning and pre-training.

Abstract

Hand-crafted features, such as Mel-filterbanks, have traditionally been the choice for many audio processing applications. Recently, there has been a growing interest in learnable front-ends that extract representations directly from the raw audio waveform. \textcolor{black}{However, both hand-crafted filterbanks and current learnable front-ends lead to fixed computation graphs at inference time, failing to dynamically adapt to varying acoustic environments, a key feature of human auditory systems.} To this end, we explore the question of whether audio front-ends should be adaptive by comparing the Ada-FE front-end (a recently developed adaptive front-end that employs a neural adaptive feedback controller to dynamically adjust the Q-factors of its spectral decomposition filters) to established learnable front-ends. Specifically, we systematically investigate learnable front-ends and Ada-FE across two commonly used back-end backbones and a wide range of audio benchmarks including speech, sound event, and music. The comprehensive results show that our Ada-FE outperforms advanced learnable front-ends, and more importantly, it exhibits impressive stability or robustness on test samples over various training epochs.

Should Audio Front-ends be Adaptive? Comparing Learnable and Adaptive Front-ends

TL;DR

To address non-adaptive front-ends, the paper proposes Ada-FE, a Gabor-based front-end with a fixed first layer and an adaptive second layer whose Q-factor is updated frame-by-frame via two modules: a level-dependent adaptation and a neural adaptive feedback controller. The method is compared against several lightweight learnable front-ends across two backbones on eight audio and speech benchmarks, including speech, sound events, and music tasks. Results show Ada-FE consistently outperforms state-of-the-art learnable front-ends and demonstrates strong robustness to test-time variations, with the simplified Ada-FE-S-FM achieving competitive or better accuracy while using fewer hand-crafted components. The findings suggest adaptive front-ends offer meaningful gains in real-world environments and open avenues for integrating adaptivity with audio representation learning and pre-training.

Abstract

Hand-crafted features, such as Mel-filterbanks, have traditionally been the choice for many audio processing applications. Recently, there has been a growing interest in learnable front-ends that extract representations directly from the raw audio waveform. \textcolor{black}{However, both hand-crafted filterbanks and current learnable front-ends lead to fixed computation graphs at inference time, failing to dynamically adapt to varying acoustic environments, a key feature of human auditory systems.} To this end, we explore the question of whether audio front-ends should be adaptive by comparing the Ada-FE front-end (a recently developed adaptive front-end that employs a neural adaptive feedback controller to dynamically adjust the Q-factors of its spectral decomposition filters) to established learnable front-ends. Specifically, we systematically investigate learnable front-ends and Ada-FE across two commonly used back-end backbones and a wide range of audio benchmarks including speech, sound event, and music. The comprehensive results show that our Ada-FE outperforms advanced learnable front-ends, and more importantly, it exhibits impressive stability or robustness on test samples over various training epochs.

Paper Structure

This paper contains 16 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Illustration of (a) fixed, (b) learnable but non-adaptive, and (c) learnable and adaptive audio front-ends. In real-world acoustic scenarios, speech and audio signals are inevitably shaped by varying acoustic conditions during transmission. The input speech level, for instance, varies with the distance between the speakers and the microphone and the voice levels of the speakers. (a) Fixed (hand-crafted) front-ends extract feature using fixed filters during training and inference regardless of the acoustic scene. (b) Existing neural front-ends learn a common model to deal with the different acoustic scenes present in the training data but cannot change to compensate for previously unseen acoustic scenarios at test time. (c) Our adaptive front-end employs a neural controller to dynamically adjust the filter response (or shape) to varying acoustic conditions during training and inference.
  • Figure 2: Illustration of audio/speech representation methods, which mainly include Fixed, hand-crafted features (e.g., Mel- and Gammaton-filterbanks), lightweight learnable front-ends (e.g., TD-fbanks, LEAF, and SincNet), our dynamically adaptive front-end (Ada-FE), and pre-trained audio models (e.g., Wav2Vec, AST, WavLM, and HuBERT).
  • Figure 3: Illustrations of (a) the overall diagram of the adaptive front-end (Ada-FE) wickramasinghe2023dnn and (b) the simplified adaptive front-end (Ada-FE-S), where the hand-crafted level-dependent adaptation function module is removed and the adaptive Q value is completely controlled by the neural adaptive feedback controller (orange box).
  • Figure 4: Illustrations of the time impulse and frequency responses of three Gabor filters, where $f_{c}\!=\!3000$ and $Q\!=\!\left\{1.5, 2.0, 2.5\right\}$, respectively.
  • Figure 5: Illustration of the LDA module.
  • ...and 9 more figures