Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review
Athul Raimon, Shubha Masti, Shyam K Sateesh, Siyani Vengatagiri, Bhaskarjyoti Das
TL;DR
This survey tackles the challenge of limited annotated data in audio by aggregating meta-learning approaches tailored to audio and speech tasks. It analyzes the end-to-end pipeline from data preprocessing and feature representations to task selection and meta-learner architectures, with a focus on $N$-way $K$-shot settings and cross-domain transfer. Key findings highlight that Prototypical Networks perform well in low-data regimes, MAML offers robust generalization across tasks and languages, and that data augmentation plus specialized loss functions substantially boost performance in noisy or polyphonic audio. The review also identifies open challenges, such as domain mismatch, open-set detection, and multi-label scenarios, and provides guidance on datasets and practical considerations to advance audio meta-learning in real-world, low-resource settings.
Abstract
This survey overviews various meta-learning approaches used in audio and speech processing scenarios. Meta-learning is used where model performance needs to be maximized with minimum annotated samples, making it suitable for low-sample audio processing. Although the field has made some significant contributions, audio meta-learning still lacks the presence of comprehensive survey papers. We present a systematic review of meta-learning methodologies in audio processing. This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies and also presents important datasets in audio, together with crucial real-world use cases. Through this extensive review, we aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing.
