Should Audio Front-ends be Adaptive? Comparing Learnable and Adaptive Front-ends
Qiquan Zhang, Buddhi Wickramasinghe, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
TL;DR
To address non-adaptive front-ends, the paper proposes Ada-FE, a Gabor-based front-end with a fixed first layer and an adaptive second layer whose Q-factor is updated frame-by-frame via two modules: a level-dependent adaptation and a neural adaptive feedback controller. The method is compared against several lightweight learnable front-ends across two backbones on eight audio and speech benchmarks, including speech, sound events, and music tasks. Results show Ada-FE consistently outperforms state-of-the-art learnable front-ends and demonstrates strong robustness to test-time variations, with the simplified Ada-FE-S-FM achieving competitive or better accuracy while using fewer hand-crafted components. The findings suggest adaptive front-ends offer meaningful gains in real-world environments and open avenues for integrating adaptivity with audio representation learning and pre-training.
Abstract
Hand-crafted features, such as Mel-filterbanks, have traditionally been the choice for many audio processing applications. Recently, there has been a growing interest in learnable front-ends that extract representations directly from the raw audio waveform. \textcolor{black}{However, both hand-crafted filterbanks and current learnable front-ends lead to fixed computation graphs at inference time, failing to dynamically adapt to varying acoustic environments, a key feature of human auditory systems.} To this end, we explore the question of whether audio front-ends should be adaptive by comparing the Ada-FE front-end (a recently developed adaptive front-end that employs a neural adaptive feedback controller to dynamically adjust the Q-factors of its spectral decomposition filters) to established learnable front-ends. Specifically, we systematically investigate learnable front-ends and Ada-FE across two commonly used back-end backbones and a wide range of audio benchmarks including speech, sound event, and music. The comprehensive results show that our Ada-FE outperforms advanced learnable front-ends, and more importantly, it exhibits impressive stability or robustness on test samples over various training epochs.
