Table of Contents
Fetching ...

A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Antonin Gagnere, Geoffroy Peeters, Slim Essid

TL;DR

A novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking and shows that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime.

Abstract

In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

TL;DR

A novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking and shows that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime.

Abstract

In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

Paper Structure

This paper contains 24 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our proposed contrastive ssl scheme for beat tracking. The left part displays our processed audio waveform to obtain the representations $z_{t}$. The right part displays our mining of positive and negatives.
  • Figure 2: Proposed mining strategy of Positives and Negatives (easy and hard) given an Anchor time in the plp function. Positive are sampled among peaks of the PLP whose time index is distant from the Anchor by a power of two tatum units $tu$ (here $\alpha=4 \times tu$ ); Negatives are the remaining times and are considered Easy if not peaks of the PLP and Hard if peaks of the PLP. Here we sample two hard and two easy negatives.
  • Figure 3: Results of Experiment 1: Few-Shot Learning. Shaded areas representation the standard deviation. (ZeroNS in green and our method in blue)
  • Figure :