Music Era Recognition Using Supervised Contrastive Learning and Artist Information
Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li
TL;DR
This work tackles music era recognition by casting it as a year-range classification problem and proposes a progression of models from a CNN baseline to supervised contrastive learning (Audio-SUC) and a multimodal framework (AudioArt-MMC) that fuses audio with artist biography through a Transformer-based fusion module. The models employ loss terms $L_{\mathrm{MLE}}$, $L_{\mathrm{EC}}$, and $L_{\mathrm{MMC}}$, with text-shuffle augmentations enabling robust multimodal contrastive learning. On Million Song Dataset and an internal In-House collection, the audio-only approach achieves $54\%$ accuracy within a ±3-year tolerance, while incorporating artist information via MMC yields a further $+9\%$ improvement, demonstrating resilience to data imbalance and effective discrimination across near-year ranges. The approach has practical implications for era-aware playlisting and metadata augmentation when release years are missing or noisy.
Abstract
Does popular music from the 60s sound different than that of the 90s? Prior study has shown that there would exist some variations of patterns and regularities related to instrumentation changes and growing loudness across multi-decadal trends. This indicates that perceiving the era of a song from musical features such as audio and artist information is possible. Music era information can be an important feature for playlist generation and recommendation. However, the release year of a song can be inaccessible in many circumstances. This paper addresses a novel task of music era recognition. We formulate the task as a music classification problem and propose solutions based on supervised contrastive learning. An audio-based model is developed to predict the era from audio. For the case where the artist information is available, we extend the audio-based model to take multimodal inputs and develop a framework, called MultiModal Contrastive (MMC) learning, to enhance the training. Experimental result on Million Song Dataset demonstrates that the audio-based model achieves 54% in accuracy with a tolerance of 3-years range; incorporating the artist information with the MMC framework for training leads to 9% improvement further.
