Table of Contents
Fetching ...

Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

Nghia Phan, Rong Jin, Gang Liu, Xiao Dong

TL;DR

This paper tackles Automatic Chord Recognition under data scarcity by introducing a two-stage training pipeline that first learns from unlabeled audio via pseudo-labels generated by a pre-trained teacher, then continually learns from ground-truth labels with selective knowledge distillation to prevent forgetting. The approach works across architectures, demonstrated with a deep BTC teacher and a lighter 2E1D transformer-based student, yielding substantial gains over supervised baselines and the teacher, especially for rare chord qualities. Key contributions include decoupling pseudo-label pretraining from labeled-data fine-tuning, a formal KD regularization framework, selective KD to manage confidence, and evidence that large unlabeled datasets can compensate for scarce labeled data in ACR. The results suggest practical impact for leveraging open-weight models and unlabeled corpora to achieve robust chord recognition in real-world, diverse musical content.

Abstract

Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.

Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

TL;DR

This paper tackles Automatic Chord Recognition under data scarcity by introducing a two-stage training pipeline that first learns from unlabeled audio via pseudo-labels generated by a pre-trained teacher, then continually learns from ground-truth labels with selective knowledge distillation to prevent forgetting. The approach works across architectures, demonstrated with a deep BTC teacher and a lighter 2E1D transformer-based student, yielding substantial gains over supervised baselines and the teacher, especially for rare chord qualities. Key contributions include decoupling pseudo-label pretraining from labeled-data fine-tuning, a formal KD regularization framework, selective KD to manage confidence, and evidence that large unlabeled datasets can compensate for scarce labeled data in ACR. The results suggest practical impact for leveraging open-weight models and unlabeled corpora to achieve robust chord recognition in real-world, diverse musical content.

Abstract

Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.
Paper Structure (19 sections, 13 equations, 6 figures, 3 tables)

This paper contains 19 sections, 13 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Duration-weighted chord root distribution across pseudo-labeled datasets. The dashed line indicates uniform distribution (8.33%). Pitch classes are well-represented with 98.4% uniformity.
  • Figure 2: Constant-Q Transform (CQT) comparison revealing pitch-shifting artifacts. Top row: original and pitch-shifted ($-5$ semitones) spectrograms. Bottom row: artifact intensity maps for $\pm5$ semitones, computed by realigning shifted CQT bins to compensate for the intended frequency shift.
  • Figure 3: Illustration of the proposed two-stage training pipeline. Details of audio data are mentioned in \ref{['sec:data']}. Note: The resulting Student Model CL of stage 2 can be continually trained when additional labeled data is available.
  • Figure 4: Experimental Dual Encoder Architecture (2E1D): The model consists of separate temporal and frequency encoders that process CQT features independently before fusion for chord classification.
  • Figure 5: Evaluation loss of BTC model during continual training with different KD weights. Higher $\alpha$ values provide stronger regularization, mitigating performance degradation from noisy labels.
  • ...and 1 more figures