Table of Contents
Fetching ...

Music auto-tagging in the long tail: A few-shot approach

T. Aleksandra Ma, Alexander Lerch

TL;DR

This work tackles the challenge of scalable, long-tail multi-label music auto-tagging by introducing a transfer-learning-based few-shot framework. It uses frozen pre-trained audio embeddings (VGGish, OpenL3, PaSST) as inputs to a lightweight linear probe, enabling new tags to be learned from only a few labeled examples. Experiments on MagnaTagATune top-50 tags show that combining multiple embeddings yields near state-of-the-art performance with as few as 20 samples per tag, and even full-data performance rivals leading models. The findings demonstrate strong data-efficiency and transferability, suggesting practical benefits for customizable tag taxonomies and long-tail tag expansion in music catalogs.

Abstract

In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.

Music auto-tagging in the long tail: A few-shot approach

TL;DR

This work tackles the challenge of scalable, long-tail multi-label music auto-tagging by introducing a transfer-learning-based few-shot framework. It uses frozen pre-trained audio embeddings (VGGish, OpenL3, PaSST) as inputs to a lightweight linear probe, enabling new tags to be learned from only a few labeled examples. Experiments on MagnaTagATune top-50 tags show that combining multiple embeddings yields near state-of-the-art performance with as few as 20 samples per tag, and even full-data performance rivals leading models. The findings demonstrate strong data-efficiency and transferability, suggesting practical benefits for customizable tag taxonomies and long-tail tag expansion in music catalogs.

Abstract

In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.
Paper Structure (19 sections, 5 figures, 1 table)

This paper contains 19 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Experimental setup: few-shot linear probes are trained on $N\times K$ number of data points, and full linear probes are trained with all the data in the training set. Test performance metrics are calculated using probabilities from the Sigmoid activation function on the full test set.
  • Figure 2: Distribution of Top 50 tags of MagnaTagATune by split.
  • Figure 3: Percentage of weight magnitude sum of each pre-trained embeddings in the overall weight magnitude sum of the combined feature.
  • Figure 4: Performance comparison of 50-way classifiers dependent on the number of training samples per class: Top: mAP, Bottom: Correlation coefficient between 20-shot probe weights and full probe weights.
  • Figure 5: Heatmap of how mean average precision (mAP) and area under the receiver operating characteristics curve (AUC-ROC) changes as the number of classes (horiz.) and the number of training samples per class (vert.) increase. Only the best-performing model is shown.