Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

Athul Raimon; Shubha Masti; Shyam K Sateesh; Siyani Vengatagiri; Bhaskarjyoti Das

Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

Athul Raimon, Shubha Masti, Shyam K Sateesh, Siyani Vengatagiri, Bhaskarjyoti Das

TL;DR

This survey tackles the challenge of limited annotated data in audio by aggregating meta-learning approaches tailored to audio and speech tasks. It analyzes the end-to-end pipeline from data preprocessing and feature representations to task selection and meta-learner architectures, with a focus on $N$-way $K$-shot settings and cross-domain transfer. Key findings highlight that Prototypical Networks perform well in low-data regimes, MAML offers robust generalization across tasks and languages, and that data augmentation plus specialized loss functions substantially boost performance in noisy or polyphonic audio. The review also identifies open challenges, such as domain mismatch, open-set detection, and multi-label scenarios, and provides guidance on datasets and practical considerations to advance audio meta-learning in real-world, low-resource settings.

Abstract

This survey overviews various meta-learning approaches used in audio and speech processing scenarios. Meta-learning is used where model performance needs to be maximized with minimum annotated samples, making it suitable for low-sample audio processing. Although the field has made some significant contributions, audio meta-learning still lacks the presence of comprehensive survey papers. We present a systematic review of meta-learning methodologies in audio processing. This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies and also presents important datasets in audio, together with crucial real-world use cases. Through this extensive review, we aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing.

Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

TL;DR

-way

-shot settings and cross-domain transfer. Key findings highlight that Prototypical Networks perform well in low-data regimes, MAML offers robust generalization across tasks and languages, and that data augmentation plus specialized loss functions substantially boost performance in noisy or polyphonic audio. The review also identifies open challenges, such as domain mismatch, open-set detection, and multi-label scenarios, and provides guidance on datasets and practical considerations to advance audio meta-learning in real-world, low-resource settings.

Abstract

Paper Structure (27 sections, 1 figure, 5 tables)

This paper contains 27 sections, 1 figure, 5 tables.

Introduction
Background
Audio Specific Meta-Learning Approaches
Data Preprocessing
Sampling Rates.
Features.
Signal to Noise Ratio (SNR).
Data Augmentation Techniques.
Traditional FSL Methods
Prototypical Networks.
Dynamic Few-Shot Continual Learning (DFSL).
Model Agnostic Meta-Learning (MAML).
Enhancements to Traditional FSL Methods
Changing Loss Function.
Encoders.
...and 12 more sections

Figures (1)

Figure 1: Overview of Few-Shot Learning Techniques

Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

TL;DR

Abstract

Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

Authors

TL;DR

Abstract

Table of Contents

Figures (1)