Table of Contents
Fetching ...

Myna: Masking-Based Contrastive Learning of Musical Representations

Ori Yonay, Tracy Hammond, Tianbao Yang

TL;DR

The paper tackles efficient self-supervised musical representation learning by eliminating domain-specific augmentations and adopting a masking-based contrastive framework. Myna leverages a Vision Transformer on mel-spectrograms and masks 90% of tokens to generate views, enabling large per-GPU batch sizes and single-GPU training. The authors introduce Myna-Hybrid, which combines square and vertical patch configurations to achieve state-of-the-art-like performance among models trained on publicly available data, notably excelling in key detection. Across MagnaTagATune, GTZAN, GiantSteps, and EmoMusic, Myna demonstrates strong generalization, with MAE more limited to local detail tasks. This work highlights masking-based contrastive learning as a scalable, domain-agnostic approach for musically meaningful representations and motivates further scaling and cross-modal applications.

Abstract

We present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone and (2) a novel data augmentation strategy, token masking, that masks 90 percent of spectrogram tokens. These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in prior methods (CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations, Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results. Trained on a single GPU, it outperforms MULE (62M) on average and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data. We release our code and models to promote reproducibility and facilitate future research.

Myna: Masking-Based Contrastive Learning of Musical Representations

TL;DR

The paper tackles efficient self-supervised musical representation learning by eliminating domain-specific augmentations and adopting a masking-based contrastive framework. Myna leverages a Vision Transformer on mel-spectrograms and masks 90% of tokens to generate views, enabling large per-GPU batch sizes and single-GPU training. The authors introduce Myna-Hybrid, which combines square and vertical patch configurations to achieve state-of-the-art-like performance among models trained on publicly available data, notably excelling in key detection. Across MagnaTagATune, GTZAN, GiantSteps, and EmoMusic, Myna demonstrates strong generalization, with MAE more limited to local detail tasks. This work highlights masking-based contrastive learning as a scalable, domain-agnostic approach for musically meaningful representations and motivates further scaling and cross-modal applications.

Abstract

We present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone and (2) a novel data augmentation strategy, token masking, that masks 90 percent of spectrogram tokens. These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in prior methods (CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations, Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results. Trained on a single GPU, it outperforms MULE (62M) on average and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data. We release our code and models to promote reproducibility and facilitate future research.

Paper Structure

This paper contains 31 sections, 2 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Myna is efficient: we achieve competitive downstream task performance while requiring significantly fewer computational resources compared to other models. Models trained on public datasets are represented in blue, while models trained on private datasets are shown in green. Myna is trained on a publicly-available dataset and is marked in red.
  • Figure 2: The Myna pre-training framework. Tokens from spectrogram patches are randomly masked before being processed by a transformer encoder. The resulting embeddings are contrasted to maximize similarity between masked views of the same data while minimizing similarity with all other samples (negatives). Tokenizers, encoders and projector modules refer to the same sets of shared weights. For downstream tasks, the projector is discarded and replaced with a task-specific head (labeled "Probe" above) to leverage the learned embeddings.
  • Figure 3: Hybrid model training. A three-second spectrogram is sampled and made into patches. After masking, the patches are processed by their respective tokenizer, consisting of a linear projection and positional embedding. The resulting tokens are fed to a shared encoder/projector module. To compute the hybrid loss, two forward passes are performed with vertical and square patches. The hybrid loss is the average of the vertical and square losses.
  • Figure 4: Performance of varying masking ratio on different datasets: MagnaTagATune, GiantSteps, and average across all four benchmarks (MTT, GiantSteps, EmoMusic, and GTZAN).
  • Figure 5: T-SNE visualizations of different embeddings (top to bottom: Myna-Hybrid, MAE, and CLMR) for the GTZAN dataset. Each subplot shows the distribution of samples in the training, validation, and test subsets, with color-coding by class label. The GTZAN dataset was not used in training any of these models.
  • ...and 4 more figures