Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification
Xuanyu Zhuang, Geoffroy Peeters, Gaël Richard
TL;DR
The paper tackles few-shot audio classification by enhancing Prototypical Networks through Rotational Division Fine-Tuning (RDFT), which uses labeled support data to fine-tune the ProtoNet in a test episode without leveraging the query set. It further embeds ProtoNet into optimization-based meta-learners, yielding MAML-Proto and MC-Proto via an episodic fine-tuning strategy that applies RDFT within inner updates. Empirical results on ESC-50 and Speech Commands v2 show that RDFT alone can degrade ProtoNet, but when integrated with MAML/Meta-Curvature, the proposed models achieve substantial gains over a regular ProtoNet, with MC-Proto delivering the strongest accuracy among the tested configurations (though Proto-HA remains SOTA on ESC-50). The approach is presented as a general framework with potential applicability beyond audio to other modalities, and future work includes extending RDFT to additional metric-based FSL methods and providing theoretical insights.
Abstract
The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used for evaluation). We then propose an algorithmic framework that combines ProtoNet with optimization-based FSL algorithms (MAML and Meta-Curvature) to work with such a fine-tuning method. Since optimization-based algorithms endow the target learner model with the ability to fast adaption to only a few samples, we utilize ProtoNet as the target model to enhance its fine-tuning performance with the help of a specifically designed episodic fine-tuning strategy. The experimental results confirm that our proposed models, MAML-Proto and MC-Proto, combined with our unique fine-tuning method, outperform regular ProtoNet by a large margin in few-shot audio classification tasks on the ESC-50 and Speech Commands v2 datasets. We note that although we have only applied our model to the audio domain, it is a general method and can be easily extended to other domains.
