Table of Contents
Fetching ...

Controlling Contrastive Self-Supervised Learning with Knowledge-Driven Multiple Hypothesis: Application to Beat Tracking

Antonin Gagnere, Slim Essid, Geoffroy Peeters

TL;DR

The paper tackles ambiguity in beat and downbeat labeling by introducing Knowledge-Driven Multi-Hypothesis Learning (KD-MHL) to guide contrastive self-supervised pre-training with multiple domain-informed hypotheses, defined as $K = \\sum_{\\omega \\in \\Omega} \\omega$. It defines an encoder with multiple projection heads, a scoring function for hypothesis compatibility, and a selector that retains the top hypotheses to form the SSL loss; the approach is instantiated for musical rhythm analysis with PLP-based hypotheses, and a new self-training variant is explored. On beat and downbeat tracking benchmarks, KD-MHL achieves state-of-the-art results after pre-training and fine-tuning, often surpassing prior methods by notable margins; and the self-training variant attains additional gains on most datasets. The work demonstrates that embedding domain knowledge into SSL and leveraging multiple plausible interpretations can substantially improve MIR representations and downstream rhythm tasks.

Abstract

Ambiguities in data and problem constraints can lead to diverse, equally plausible outcomes for a machine learning task. In beat and downbeat tracking, for instance, different listeners may adopt various rhythmic interpretations, none of which would necessarily be incorrect. To address this, we propose a contrastive self-supervised pre-training approach that leverages multiple hypotheses about possible positive samples in the data. Our model is trained to learn representations compatible with different such hypotheses, which are selected with a knowledge-based scoring function to retain the most plausible ones. When fine-tuned on labeled data, our model outperforms existing methods on standard benchmarks, showcasing the advantages of integrating domain knowledge with multi-hypothesis selection in music representation learning in particular.

Controlling Contrastive Self-Supervised Learning with Knowledge-Driven Multiple Hypothesis: Application to Beat Tracking

TL;DR

The paper tackles ambiguity in beat and downbeat labeling by introducing Knowledge-Driven Multi-Hypothesis Learning (KD-MHL) to guide contrastive self-supervised pre-training with multiple domain-informed hypotheses, defined as . It defines an encoder with multiple projection heads, a scoring function for hypothesis compatibility, and a selector that retains the top hypotheses to form the SSL loss; the approach is instantiated for musical rhythm analysis with PLP-based hypotheses, and a new self-training variant is explored. On beat and downbeat tracking benchmarks, KD-MHL achieves state-of-the-art results after pre-training and fine-tuning, often surpassing prior methods by notable margins; and the self-training variant attains additional gains on most datasets. The work demonstrates that embedding domain knowledge into SSL and leveraging multiple plausible interpretations can substantially improve MIR representations and downstream rhythm tasks.

Abstract

Ambiguities in data and problem constraints can lead to diverse, equally plausible outcomes for a machine learning task. In beat and downbeat tracking, for instance, different listeners may adopt various rhythmic interpretations, none of which would necessarily be incorrect. To address this, we propose a contrastive self-supervised pre-training approach that leverages multiple hypotheses about possible positive samples in the data. Our model is trained to learn representations compatible with different such hypotheses, which are selected with a knowledge-based scoring function to retain the most plausible ones. When fine-tuned on labeled data, our model outperforms existing methods on standard benchmarks, showcasing the advantages of integrating domain knowledge with multi-hypothesis selection in music representation learning in particular.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of our SSL framework based on KD-MHL. The input sample $\bf{x}$ is encoded by $g_{\theta}$ into a representation $\bf{z}$ which is further projected by $K$ heads $f_{\theta}^k$ corresponding to multiple hypotheses $\mathcal{H}_k$ from a pool $\mathcal{H}$. Hypothesis are driven by the knowledge of the domain and lead to specific strategies for sampling anchors, positives, and negative samples within a contrastive framework. A function $h_k$ scores each hypothesis. These scores are used by a mechanism $s$ which selects the $n$ winning hypotheses. At each step, the encoder is trained considering only the winning hypotheses (i.e. considering only the loss contributions from the winning heads).
  • Figure 2: Instantiation of our SSL framework based on KD-MHL for musical rhythm analysis (beat and downbeat tracking). The input is a sequence $\bf{x}_t$ that represents the audio signal over time $t$ which is projected by $g_{\theta}$ into a sequence of $\bf{z}_t$. The objective is to train $g_{\theta}$ such that $\bf{z}$ takes different values when $t$ is a beat or not. This is achieved using contrast learning, sampling triplets (anchors, positive and negative times). Driven by knowledge, we create a pool of hypothesis $\mathcal{H}_k \in \mathcal{H}$ which correspond to possible metrical relationship between PLP peaks (which define the time units $t$) and beats; and therefore correspond to specific triplet samplings $\mathcal{T}_k$. Each $\mathcal{H}_k$ is scored by $h_k$ considering the audio features evolution under the given metrical relationship. These scores are used by a mechanism $s$ which selects the $n$ winning hypotheses. At each step, the encoder $g_{\theta}$ is trained considering only the $n$ winning hypotheses (i.e. considering only the loss contributions from the $n$ winning heads)
  • Figure 3: PLP function (blue) alongside beat annotations (dashed lines) and detected peaks (green crosses). The top plot display a (4,2) hypothesis, the bottom one displays phase shifting issue.