uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Afrina Tabassum; Dung Tran; Trung Dang; Ismini Lourentzou; Kazuhito Koishida

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Afrina Tabassum, Dung Tran, Trung Dang, Ismini Lourentzou, Kazuhito Koishida

Abstract

Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Abstract

Paper Structure (9 sections, 3 equations, 6 figures, 4 tables)

This paper contains 9 sections, 3 equations, 6 figures, 4 tables.

Introduction
Related Work
Methodology
Experiments
Few-shot Learning
Few-shot Ablation Studies
Fine-tuning
Qualitative Analysis
Conclusion

Figures (6)

Figure 1: uaMix-MAE overview. Left: T-CutMix contrastive tuning. Right: Progressive retraining of $f_\theta$ and $h_\theta$. DA: Data Augmentation.
Figure 2: (Top row) When original audio samples $e_i$, $e_j$ are passed through $f_\theta$ and $h_\theta$, the positive pair is close to each other, and the negative pair lies far in the feature space, resulting in a sharp decision boundary in the virtual label space. (Bottom row) uaMix-MAE creates mixed audio samples $m_{ij}$ and $m_{ji}$ in the input space and uses a softened distance function in the virtual label space, resulting in a smoother decision boundary.
Figure 3: N-way k-shot performance comparison on VoxCeleb1.
Figure 4: Few-shot performance comparison on Voxceleb1 among uaMix-MAE variants: No Mixing and MixUp + LS.
Figure 5: Few-shot performance comparison between uaMix-MAE and uaMix-MAE-TF-CutMix on VoxCeleb1.
...and 1 more figures

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Abstract

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Authors

Abstract

Table of Contents

Figures (6)