MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations
Calum Heggan, Tim Hospedales, Sam Budgett, Mehrdad Yaghoobi
TL;DR
MT-SLVR addresses the problem that a single augmentation-invariance bias may be suboptimal for diverse downstream tasks by introducing a multi-task self-supervised framework that jointly learns augmentation-invariant and augmentation-sensitive features. It combines a contrastive loss with a Multi-Label Augmentation Prediction objective, enabled by a parameter-efficient architecture that uses shared backbones and task-specific adapters to realize L_Total = L_Cont + \lambda L_MLAP. Evaluated on ten audio and speech few-shot datasets with a frozen ResNet-18 backbone and linear evaluators, MT-SLVR consistently outperforms baselines and reveals that different heads capture distinct invariances, enabling flexible downstream adaptation. The invariance analysis and adapter-based design demonstrate that a mixed representation accelerates data-efficient transfer for voice-related tasks and beyond, suggesting practical benefits for rapid deployment in low-label regimes. The approach thus offers a scalable path to robust, adaptable audio recognition across diverse applications.
Abstract
Contrastive self-supervised learning has gained attention for its ability to create high-quality representations from large unlabelled data sets. A key reason that these powerful features enable data-efficient learning of downstream tasks is that they provide augmentation invariance, which is often a useful inductive bias. However, the amount and type of invariances preferred is not known apriori, and varies across different downstream tasks. We therefore propose a multi-task self-supervised framework (MT-SLVR) that learns both variant and invariant features in a parameter-efficient manner. Our multi-task representation provides a strong and flexible feature that benefits diverse downstream tasks. We evaluate our approach on few-shot classification tasks drawn from a variety of audio domains and demonstrate improved classification performance on all of them
