Table of Contents
Fetching ...

Cue Point Estimation using Object Detection

Giulia Argüello, Luca A. Lanzendörfer, Roger Wattenhofer

TL;DR

This work reframes cue-point estimation as an object-detection problem by applying a DETR-based transformer to Mel-spectrograms, enabling precise localization of cue points without heavy low-level audio analysis. It introduces the EDM-CUE dataset, comprising 4,710 EDM tracks and 21k manually annotated cue points, substantially expanding the available data for this task. The proposed CUE-DETR model achieves higher precision and better phrase-alignment than prior methods (Automix, MIK), and a supplementary phrasing-based evaluation provides an objective measure of structural alignment. All code, model checkpoints, and the EDM-CUE dataset are openly released to support reproducibility and further research in DJ-related MIR tasks.

Abstract

Cue points indicate possible temporal boundaries in a transition between two pieces of music in DJ mixing and constitute a crucial element in autonomous DJ systems as well as for live mixing. In this work, we present a novel method for automatic cue point estimation, interpreted as a computer vision object detection task. Our proposed system is based on a pre-trained object detection transformer which we fine-tune on our novel cue point dataset. Our provided dataset contains 21k manually annotated cue points from human experts as well as metronome information for nearly 5k individual tracks, making this dataset 35x larger than the previously available cue point dataset. Unlike previous methods, our approach does not require low-level musical information analysis, while demonstrating increased precision in retrieving cue point positions. Moreover, our proposed method demonstrates high adherence to phrasing, a type of high-level music structure commonly emphasized in electronic dance music. The code, model checkpoints, and dataset are made publicly available.

Cue Point Estimation using Object Detection

TL;DR

This work reframes cue-point estimation as an object-detection problem by applying a DETR-based transformer to Mel-spectrograms, enabling precise localization of cue points without heavy low-level audio analysis. It introduces the EDM-CUE dataset, comprising 4,710 EDM tracks and 21k manually annotated cue points, substantially expanding the available data for this task. The proposed CUE-DETR model achieves higher precision and better phrase-alignment than prior methods (Automix, MIK), and a supplementary phrasing-based evaluation provides an objective measure of structural alignment. All code, model checkpoints, and the EDM-CUE dataset are openly released to support reproducibility and further research in DJ-related MIR tasks.

Abstract

Cue points indicate possible temporal boundaries in a transition between two pieces of music in DJ mixing and constitute a crucial element in autonomous DJ systems as well as for live mixing. In this work, we present a novel method for automatic cue point estimation, interpreted as a computer vision object detection task. Our proposed system is based on a pre-trained object detection transformer which we fine-tune on our novel cue point dataset. Our provided dataset contains 21k manually annotated cue points from human experts as well as metronome information for nearly 5k individual tracks, making this dataset 35x larger than the previously available cue point dataset. Unlike previous methods, our approach does not require low-level musical information analysis, while demonstrating increased precision in retrieving cue point positions. Moreover, our proposed method demonstrates high adherence to phrasing, a type of high-level music structure commonly emphasized in electronic dance music. The code, model checkpoints, and dataset are made publicly available.
Paper Structure (15 sections, 6 figures, 2 tables)

This paper contains 15 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Top: Distribution of cue point positions in EDM-CUE. Bottom: Distribution of distances between two subsequent cue points in EDM-CUE. The inter-cue distances indicate that 16 bars is the most represented phrasing length in our dataset.
  • Figure 2: Calculation of phrase boundaries $b_i$ using cue cue points $c_i$. Phrase boundaries, highlighted in blue, serve as additional points to evaluate prediction accuracy. Example a) represents a track with regular phrasing whereas b) shows a track with an irregular phrase between cue points $c_0$ and $c_1$. The computed phrase boundaries $b_i$ include the cue points.
  • Figure 3: Pipeline of the proposed CUE-DETR architecture. During training, an input Mel spectrogram $S$ is segmented into training images $S_T$. Each $S_T$ consists of a spectrogram segment containing a cue point which is represented as a bounding box. Inference images $S_I$ move across $S$ using a sliding window. The predicted bounding boxes are converted to their center $x$-coordinate. The highest scoring positions are selected greedily among all candidates with minimum confidence $t=0.9$. A selected position excludes all other candidates within a radius $r$. The bottom spectrogram shows the predicted positions as peaks based on the confidence value.
  • Figure 4: Predicted and ground-truth cue point positions shown over three Mel spectrograms of different random tracks from the evaluation split of EDM-CUE. The confidence score for each position is illustrated as the white curve. Magenta lines indicate correct model predictions, red lines indicate wrong model predictions. For reference, solid orange lines represent ground-truth positions and dashed orange lines illustrate 16-bar phrase boundaries.
  • Figure 5: Distribution of ground-truth cue point positions in blue and predicted cue point positions in orange quantized to bars. The cosine similarity between the predicted cue point positions and ground-truth is 0.425 (Automix), 0.371 (MIK), and 0.851 (CUE-DETR).
  • ...and 1 more figures