Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

Riyansha Singh; Parinita Nema; Vinod K Kurmi

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

Riyansha Singh, Parinita Nema, Vinod K Kurmi

TL;DR

This paper tackles robust audio FSCIL by introducing supervised contrastive learning (SCL) during the base session to create tightly clustered, well-separated base-class representations, facilitating the integration of new classes. It then employs a stochastic classifier with dynamically updated prototypes to expand the classifier across incremental sessions, using a prototype loss to align new prototypes with current embeddings. The approach demonstrates state-of-the-art performance on NSynth-100, LibriSpeech LS-100, ESC-50, and ESC-10, with notable gains in average accuracy and reduced performance drop. The work advances practical few-shot incremental learning for dynamic audio vocabularies, with implications for real-time, adaptive audio analytics in smart devices and related applications.

Abstract

In machine learning applications, gradual data ingress is common, especially in audio processing where incremental learning is vital for real-time analytics. Few-shot class-incremental learning addresses challenges arising from limited incoming data. Existing methods often integrate additional trainable components or rely on a fixed embedding extractor post-training on base sessions to mitigate concerns related to catastrophic forgetting and the dangers of model overfitting. However, using cross-entropy loss alone during base session training is suboptimal for audio data. To address this, we propose incorporating supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization since it facilitates seamless integration of incremental classes, upon arrival. Experimental results on NSynth and LibriSpeech datasets with 100 classes, as well as ESC dataset with 50 and 10 classes, demonstrate state-of-the-art performance.

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

TL;DR

Abstract

Paper Structure (11 sections, 8 equations, 4 figures, 2 tables)

This paper contains 11 sections, 8 equations, 4 figures, 2 tables.

Introduction
Methodology
Problem Statement
Proposed Model
Enhancing feature representation space using contrastive learning
Base session training
Incremental session training
Results and Experiments
Experimental results
Analysis
Conclusion

Figures (4)

Figure 1: During base training, we follow a two-stage process. Initially, the model undergoes training employing the contrastive loss function to seperate classes with minimum overlap. Subsequently, the model is trained utilizing the cross-entropy loss function to guide the optimization process towards better classification performance.
Figure 2: Comparion of per-session accuracy of baseline and our SCL on ESC-50 and ESC-10 trained for 10 and 3 sessions respectively with '0' denoting the base session.
Figure 3: Base Class Embedding Space representation for LS-100
Figure 4: (a) shows that a positive correlation exists between the number of shots and the resulting accuracy scores (except 5-way 5-shot) holding the number of ways constant

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

TL;DR

Abstract

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)