Music Era Recognition Using Supervised Contrastive Learning and Artist Information

Qiqi He; Xuchen Song; Weituo Hao; Ju-Chiang Wang; Wei-Tsung Lu; Wei Li

Music Era Recognition Using Supervised Contrastive Learning and Artist Information

Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li

TL;DR

This work tackles music era recognition by casting it as a year-range classification problem and proposes a progression of models from a CNN baseline to supervised contrastive learning (Audio-SUC) and a multimodal framework (AudioArt-MMC) that fuses audio with artist biography through a Transformer-based fusion module. The models employ loss terms $L_{\mathrm{MLE}}$, $L_{\mathrm{EC}}$, and $L_{\mathrm{MMC}}$, with text-shuffle augmentations enabling robust multimodal contrastive learning. On Million Song Dataset and an internal In-House collection, the audio-only approach achieves $54\%$ accuracy within a ±3-year tolerance, while incorporating artist information via MMC yields a further $+9\%$ improvement, demonstrating resilience to data imbalance and effective discrimination across near-year ranges. The approach has practical implications for era-aware playlisting and metadata augmentation when release years are missing or noisy.

Abstract

Does popular music from the 60s sound different than that of the 90s? Prior study has shown that there would exist some variations of patterns and regularities related to instrumentation changes and growing loudness across multi-decadal trends. This indicates that perceiving the era of a song from musical features such as audio and artist information is possible. Music era information can be an important feature for playlist generation and recommendation. However, the release year of a song can be inaccessible in many circumstances. This paper addresses a novel task of music era recognition. We formulate the task as a music classification problem and propose solutions based on supervised contrastive learning. An audio-based model is developed to predict the era from audio. For the case where the artist information is available, we extend the audio-based model to take multimodal inputs and develop a framework, called MultiModal Contrastive (MMC) learning, to enhance the training. Experimental result on Million Song Dataset demonstrates that the audio-based model achieves 54% in accuracy with a tolerance of 3-years range; incorporating the artist information with the MMC framework for training leads to 9% improvement further.

Music Era Recognition Using Supervised Contrastive Learning and Artist Information

TL;DR

, and

, with text-shuffle augmentations enabling robust multimodal contrastive learning. On Million Song Dataset and an internal In-House collection, the audio-only approach achieves

accuracy within a ±3-year tolerance, while incorporating artist information via MMC yields a further

improvement, demonstrating resilience to data imbalance and effective discrimination across near-year ranges. The approach has practical implications for era-aware playlisting and metadata augmentation when release years are missing or noisy.

Abstract

Paper Structure (12 sections, 6 equations, 4 figures, 2 tables)

This paper contains 12 sections, 6 equations, 4 figures, 2 tables.

Introduction
Proposed Method
Convolutional Neural Network (CNN)
Supervised Contrastive (SUC) Learning
Multi-Modal Contrastive (MMC) Learning
Multi-Modal Fusion Module
Multi-Modal Contrastive (MMC) Loss
Experiments
Experiment Setup
Results and Discussion
Conclusion
Acknowledgement

Figures (4)

Figure 1: Illustrating the effects of contrastive learning. From (a) to (b), EC loss help aggregating songs based on the era classes. From (c) to (d), MMC loss improves clustering songs based on artists.
Figure 2: Era distribution of our In-House dataset. Upper sub-figure covers the years of from 1960 to 2020. Lower sub-figure enlarges the years of pre-1990.
Figure 3: The overall illustration of AudioArt-MMC. (a) is the Audio-Artist Encoder mainly contains audio encoder, artist biography text encoder and a multi-modal fusion module. (b) shows the encoded latent representation ${\boldsymbol{z}}$ that goes through corresponding projection heads for MMC loss and EC loss, respectively.
Figure 4: The t-SNE visualization of the latent representation of different methods. The active years of an artist is marked in the parenthesis.

Music Era Recognition Using Supervised Contrastive Learning and Artist Information

TL;DR

Abstract

Music Era Recognition Using Supervised Contrastive Learning and Artist Information

Authors

TL;DR

Abstract

Table of Contents

Figures (4)