Table of Contents
Fetching ...

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Afrina Tabassum, Dung Tran, Trung Dang, Ismini Lourentzou, Kazuhito Koishida

Abstract

Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Abstract

Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE
Paper Structure (9 sections, 3 equations, 6 figures, 4 tables)

This paper contains 9 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: uaMix-MAE overview. Left: T-CutMix contrastive tuning. Right: Progressive retraining of $f_\theta$ and $h_\theta$. DA: Data Augmentation.
  • Figure 2: (Top row) When original audio samples $e_i$, $e_j$ are passed through $f_\theta$ and $h_\theta$, the positive pair is close to each other, and the negative pair lies far in the feature space, resulting in a sharp decision boundary in the virtual label space. (Bottom row) uaMix-MAE creates mixed audio samples $m_{ij}$ and $m_{ji}$ in the input space and uses a softened distance function in the virtual label space, resulting in a smoother decision boundary.
  • Figure 3: N-way k-shot performance comparison on VoxCeleb1.
  • Figure 4: Few-shot performance comparison on Voxceleb1 among uaMix-MAE variants: No Mixing and MixUp + LS.
  • Figure 5: Few-shot performance comparison between uaMix-MAE and uaMix-MAE-TF-CutMix on VoxCeleb1.
  • ...and 1 more figures