Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Haesun Joung; Kyogu Lee

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Haesun Joung, Kyogu Lee

TL;DR

This work tackles robust music auto-tagging under real-world noise by adopting Domain Adversarial Training (DAT) to learn domain-invariant representations across clean and noisy audio. It introduces an additional pretraining step for the domain classifier and uses synthesized unlabeled noisy data to enhance cross-domain generalization, supported by a CLMR/SampleCNN-based feature extractor and a lightweight label predictor. The training progresses in three stages—FE pretraining, DC pretraining, and joint FE finetuning with LP training—guided by a total loss that combines tag supervision with domain confusion: $\mathcal{L}_\text{Total} = \mathcal{L}_\text{LP}^{\text{src}} + \lambda( \mathcal{L}_\text{DC}^{\text{src}} + \mathcal{L}_\text{DC}^{\text{trg}} )$. Empirical results demonstrate that increasing noise variety improves robustness, with the most unlabeled-data-efficient configuration (proposal (b)) delivering the strongest gains, and Musan tests confirming stable performance across noise types. The approach promises broad applicability for music discovery and recommendation in noisy multimedia contexts.

Abstract

Music auto-tagging is crucial for enhancing music discovery and recommendation. Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content. This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings. The approach integrates Domain Adversarial Training (DAT) into the music domain, enabling robust music representations that withstand noise. Unlike previous research, this approach involves an additional pretraining phase for the domain classifier, to avoid performance degradation in the subsequent phase. Adding various synthesized noisy music data improves the model's generalization across different noise levels. The proposed architecture demonstrates enhanced performance in music auto-tagging by effectively utilizing unlabeled noisy music data. Additional experiments with supplementary unlabeled data further improves the model's performance, underscoring its robust generalization capabilities and broad applicability.

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

TL;DR

. Empirical results demonstrate that increasing noise variety improves robustness, with the most unlabeled-data-efficient configuration (proposal (b)) delivering the strongest gains, and Musan tests confirming stable performance across noise types. The approach promises broad applicability for music discovery and recommendation in noisy multimedia contexts.

Abstract

Paper Structure (10 sections, 3 equations, 3 figures, 2 tables)

This paper contains 10 sections, 3 equations, 3 figures, 2 tables.

Introduction
RELATED WORKS
METHOD
Architecture
Training Process
DATASET
Data Configuration
EXPERIMENT
CONCLUSION
Acknowledgement

Figures (3)

Figure 1: Feature extraction from clean and noisy music tracks in robust music representation learning. The extractor aims to produce closely positioned embeddings for the same track, regardless of audio quality.
Figure 2: The proposed architecture and training process. Overall structure is composed of Feature Extractor (FE, pink), Domain Classifier (DC, yellow), and Label Predictor (LP, green). The training process is set to 3 steps for 1) pretraining FE, 2) pretraining DC, and 3) finetuning FE and training LP. In contrast, for both the baseline and oracle configurations, only the FE and LP are utilized, leading to a simplified two-step training process.
Figure 3: Proposed dataset configuration: The music dataset (red) utilizes MTAT law2009evaluation and the music split from Musan snyder2015musan. The real-world noise dataset (yellow) incorporates Audioset gemmeke2017audio and the noise split from Musan Synthesized samples combining music and noise are designated as target domain data. During training, source (src) and target (trg) domain samples do not overlap. Note that the Musan noise dataset is exclusively employed for creating the test set, while it is not used in the formation of the validation set.

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

TL;DR

Abstract

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Authors

TL;DR

Abstract

Table of Contents

Figures (3)