Table of Contents
Fetching ...

STONE: Self-supervised Tonality Estimator

Yuexuan Kong, Vincent Lostanlen, Gabriel Meseguer-Brocal, Stella Wong, Mathieu Lagrange, Romain Hennequin

TL;DR

STONE introduces a self-supervised framework for tonality estimation that learns key-signature representations without labeled data by enforcing invariance and equivariance through CPSD-based losses on paired audio segments. The ChromaNet architecture enforces octave equivalence to produce a 12-dimensional key signature profile (KSP), which is extended to a structured 24-output space to jointly predict key and mode. The approach achieves competitive results on FMAK and demonstrates a dramatic reduction in labeled data requirements via Semi-TONE, which retains near-SOTA performance with only a fraction of annotations. A key limitation is the CPSD objective’s current inability to distinguish major vs minor keys, motivating future work to broaden applicability to other pitch-relative MIR tasks.

Abstract

Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision.

STONE: Self-supervised Tonality Estimator

TL;DR

STONE introduces a self-supervised framework for tonality estimation that learns key-signature representations without labeled data by enforcing invariance and equivariance through CPSD-based losses on paired audio segments. The ChromaNet architecture enforces octave equivalence to produce a 12-dimensional key signature profile (KSP), which is extended to a structured 24-output space to jointly predict key and mode. The approach achieves competitive results on FMAK and demonstrates a dramatic reduction in labeled data requirements via Semi-TONE, which retains near-SOTA performance with only a fraction of annotations. A key limitation is the CPSD objective’s current inability to distinguish major vs minor keys, motivating future work to broaden applicability to other pitch-relative MIR tasks.

Abstract

Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision.
Paper Structure (31 sections, 17 equations, 5 figures, 3 tables)

This paper contains 31 sections, 17 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the equivariant pretext task in STONE. Given two segments A and B from an unlabeled musical recording, we compute their constant-$Q$ transforms (CQT) and apply random crops by $c$ and $(c+k)$ to simulate pitch transpositions. We feed them to ChromaNet, an equivariant neural network with octave equivalence, yielding a learned key signature profile (KSP) of 12 chromas. We compute the discrete Fourier transform (DFT) of each KSP and derive pairwise cross-power spectral densities (CPSD). Self-supervised losses $\mathcal{L}_{\mathrm{AA}}$, $\mathcal{L}_{\mathrm{AB}}$, and $\mathcal{L}_{\mathrm{BA}}$ are formulated as CPSD regression residuals in the complex domain.
  • Figure 2: We modify the ChromaNet architecture of Figure \ref{['fig:overview']} to accommodate structured prediction key signature and mode. We apply batch normalization per mode $m$ and softmax over all coefficients, yielding a $12\times2$ matrix $\mathbf{Y}_{\boldsymbol{\theta}}(\boldsymbol{x})$. Summing $\mathbf{Y}_{\boldsymbol{\theta}}(\boldsymbol{x})$ over modes $m$ yields a learned key signature profile $\lambda_{\boldsymbol{\theta}}(\boldsymbol{x})$ in dimension 12; summing $\mathbf{Y}_{\boldsymbol{\theta}}(\boldsymbol{x})$ over chromas $q$ yields a pitch-invariant 2-dimensional vector $\mu_{\boldsymbol{\theta}}(\boldsymbol{x})$.
  • Figure 3: Evaluation of self-supervised (dashed blue), semi-supervised (solid blue), and supervised models (orange) on FMAK. All models use $\omega=7$. We also report the supervised state of the art (SOTA) korzeniowski2018genre in dashed green.
  • Figure 4: Confusion matrices of STONE (left, 12 classes) and Semi-TONE (right, 24 classes) on FMAK, both using $\omega=7$. The axis correspond to model prediction and reference respectively, keys arranged by proximity in the CoF and relative modes. Deeper colors indicate more frequent occurences per relative occurence per reference key.
  • Figure :