Table of Contents
Fetching ...

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

TL;DR

This work tackles segment-based musical version matching under weak supervision by introducing CLEWS, which combines a Segment Distance Reduction framework with a decoupled, supervised contrastive loss built on Euclidean geometry. The reduction maps segment-level distances to track-level scores, while CLEWS optimizes aligned positives and dispersed negatives to produce robust embeddings. Across two large datasets, CLEWS achieves state-of-the-art track-level results and, importantly, dramatic improvements at the segment level for varying query lengths, demonstrating strong generalization to partial matches. The proposed approach, with its ablations and hyper-parameter analysis, offers a versatile blueprint for weakly-labeled contrastive learning beyond audio, with practical implications for music discovery, copyright management, and related domains.

Abstract

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

TL;DR

This work tackles segment-based musical version matching under weak supervision by introducing CLEWS, which combines a Segment Distance Reduction framework with a decoupled, supervised contrastive loss built on Euclidean geometry. The reduction maps segment-level distances to track-level scores, while CLEWS optimizes aligned positives and dispersed negatives to produce robust embeddings. Across two large datasets, CLEWS achieves state-of-the-art track-level results and, importantly, dramatic improvements at the segment level for varying query lengths, demonstrating strong generalization to partial matches. The proposed approach, with its ablations and hyper-parameter analysis, offers a versatile blueprint for weakly-labeled contrastive learning beyond audio, with practical implications for music discovery, copyright management, and related domains.

Abstract

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.

Paper Structure

This paper contains 21 sections, 17 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of four reduction functions $\mathcal{R}$ over pairwise segment distances $\tilde{d}^{kl}_{ij}$. They are depicted on different sub-rectangles $\tilde{\textbf{D}}_{ij}$, where tracks $i$, $j$, and $j+1$ are versions (green squares) and track $i+1$ is not (orange squares). The four functions correspond to: $\mathcal{R}_\text{meanmin}$ (top left), $\mathcal{R}_\text{bpwr-3}$ (top right), $\mathcal{R}_\text{best-10}$ (bottom left), and $\mathcal{R}_\text{min}$ (bottom right). The $\mathcal{R}_\text{bpwr-3}$ strategy depicts its minimum/masking recursion in increasingly dark levels (green/purple cells). The sub-rectangles for $\mathcal{R}_\text{bpwr-3}$ and $\mathcal{R}_\text{min}$ also exemplify dealing with different lengths by masking (gray cells).
  • Figure 2: Segment-level evaluation with DVI-Test. NAR (left) and MAP (right) for different query segment lengths $\tau$ (notice the logarithmic axis for NAR). The shaded regions correspond to 95% confidence intervals (barely visible due to the size of DVI-Test). Comparatively similar results for SHS-Test and also for an alternative evaluation protocol are available in Appendix \ref{['sec:app_results']}.
  • Figure 3: Effect of hyper-parameters $\gamma$ (top) and $\varepsilon$ (bottom) on DVI-Valid. Shaded regions correspond to 95% confidence intervals, and the default value is highlighted with a square marker.
  • Figure 4: Plot of $\nabla^-$ as a function of the negative pair potential $e^{-\gamma d^2_{ij}}$ for different values of $\gamma$ and $\varepsilon$. From left to right, we show $\gamma=\{2,5,10\}$. From darker to lighter, colors correspond to $\varepsilon=\{10^{-8}, 10^{-7}, 10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}\}$. Dash-dotted lines indicate each $\varepsilon$ value (notice that, in $\mathcal{L}$, $\varepsilon$ is compared to an average negative pair potential, hence placing $\varepsilon$ as a reference in the potential axis makes sense).
  • Figure 5: Segment-level evaluation with SHS-Test. NAR (left) and MAP (right) for different lengths of query segments $\tau$ (notice the logarithmic axis for NAR). The shaded regions correspond to 95% confidence intervals.
  • ...and 1 more figures