Table of Contents
Fetching ...

Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis

Axel Marmoret

Abstract

Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA. For each model, we extract barwise embeddings and segment them using three unsupervised segmentation algorithms (Foote's checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM)), focusing exclusively on boundary retrieval. Our results demonstrate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, but not systematically. Furthermore, our unsupervised boundary estimation methodology generally yields stronger performance than recent linear probing baselines. Among the evaluated techniques, the CBM algorithm consistently emerges as the most effective downstream segmentation method. Finally, we highlight the artificial inflation of standard evaluation metrics and advocate for the systematic adoption of ``trimming'', or even ``double trimming'' annotations to establish more rigorous MSA evaluation standards.

Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis

Abstract

Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA. For each model, we extract barwise embeddings and segment them using three unsupervised segmentation algorithms (Foote's checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM)), focusing exclusively on boundary retrieval. Our results demonstrate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, but not systematically. Furthermore, our unsupervised boundary estimation methodology generally yields stronger performance than recent linear probing baselines. Among the evaluated techniques, the CBM algorithm consistently emerges as the most effective downstream segmentation method. Finally, we highlight the artificial inflation of standard evaluation metrics and advocate for the systematic adoption of ``trimming'', or even ``double trimming'' annotations to establish more rigorous MSA evaluation standards.

Paper Structure

This paper contains 18 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparison of the best results obtained with deep models and the Barwise TF features (non-deep learning baseline), according to the segmentation algorithm and the dataset. The hyperparameters of the downstream segmentation algorithms are selected as the best performing ones ($\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$ average) per model and dataset.
  • Figure 2: Best results obtained with all deep learning models, and their best downstream segmentation algorithm. Rows are ordered by decreasing average of $\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$. Superscript denotes the downstream segmentation algorithm used to obtain these results ($C$: CBM, $F$: Foote). The hyperparameters of the downstream segmentation algorithms are selected as the best performing ones ($\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$ average) per model and dataset.
  • Figure 3: Best results obtained with all deep learning models, using the CBM segmentation algorithm. Rows are ordered by decreasing average of $\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$. The hyperparameters of the CBM algorithm are selected as the best performing ones ($\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$ average) per model but across datasets.
  • Figure 4: Best results obtained with all deep learning models, using the Foote segmentation algorithm. Rows are ordered by decreasing average of $\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$. The hyperparameters of the Foote algorithm are selected as the best performing ones ($\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$ average) per model but across datasets.
  • Figure 5: Best results obtained with all deep learning models, using the LSD segmentation algorithm. Rows are ordered by decreasing average of $\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$. The hyperparameters of the LSD algorithm are selected as the best performing ones ($\mathop{\mathrm{F_{0.5s}}}\limits$ and $\mathop{\mathrm{F_{3s}}}\limits$ average) per model but across datasets.
  • ...and 4 more figures