Table of Contents
Fetching ...

Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

Christos Sgouropoulos, Christos Nikou, Stefanos Vlachos, Vasileios Theiou, Christos Foukanelis, Theodoros Giannakopoulos

TL;DR

This work tackles few-shot audio classification by integrating supervised angular contrastive loss into prototypical few-shot training, enhanced with SpecAugment and self-attention to produce robust unified embeddings. The authors design four modules (augmentation, embedding, few-shot, and contrastive) and compare CPL and Angular Prototype Loss within two training regimes, demonstrating state-of-the-art performance on MetaAudio in a 5-way 5-shot setup. Key findings show that angular loss consistently improves representations over standard contrastive losses and plain ProtoNets, often matching or surpassing optimization-based methods like MAML while requiring significantly less computation. The approach advances practical few-shot audio classification and promotes reproducibility with detailed methodology and dataset handling.

Abstract

Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.

Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

TL;DR

This work tackles few-shot audio classification by integrating supervised angular contrastive loss into prototypical few-shot training, enhanced with SpecAugment and self-attention to produce robust unified embeddings. The authors design four modules (augmentation, embedding, few-shot, and contrastive) and compare CPL and Angular Prototype Loss within two training regimes, demonstrating state-of-the-art performance on MetaAudio in a 5-way 5-shot setup. Key findings show that angular loss consistently improves representations over standard contrastive losses and plain ProtoNets, often matching or surpassing optimization-based methods like MAML while requiring significantly less computation. The approach advances practical few-shot audio classification and promotes reproducibility with detailed methodology and dataset handling.

Abstract

Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.

Paper Structure

This paper contains 9 sections, 9 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Comparison of ProtoNets, FS+CPL and FS+APL in different number of shots
  • Figure 2: Module importance in overall performance per dataset.