Table of Contents
Fetching ...

COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodolà, Luca Cosmo

TL;DR

COCOLA addresses the lack of objective coherence metrics for stem-level music accompaniment by introducing a coherence-oriented contrastive objective and the COCOLA Score. The method trains a convolutional encoder to maximize agreement between disjoint stem submixes within the same audio window while pushing apart submixes from different windows, with a bilinear similarity $\text{sim}(\mathbf{h}_1,\mathbf{h}_2)=\mathbf{h}_1^\top \mathbf{W} \mathbf{h}_2$ and cross-entropy loss. It further evaluates a factorized input variant using Harmonic-Percussive Separation to disentangle harmony and rhythm contributions. Empirical results on four stem-separated datasets show high coherent-submix classification accuracy and a meaningful correlation with human judgments (MOS), enabling objective benchmarking of accompaniment-generation models and providing checkpoints for reproducibility. The work thus provides a practical, scalable tool for measuring coherence in conditional music generation.

Abstract

We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of the stems composing music tracks and can input features obtained via Harmonic-Percussive Separation (HPS). COCOLA allows the objective evaluation of generative models for music accompaniment generation, which are difficult to benchmark with established metrics. In this regard, we evaluate recent music accompaniment generation models, demonstrating the effectiveness of the proposed method. We release the model checkpoints trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).

COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

TL;DR

COCOLA addresses the lack of objective coherence metrics for stem-level music accompaniment by introducing a coherence-oriented contrastive objective and the COCOLA Score. The method trains a convolutional encoder to maximize agreement between disjoint stem submixes within the same audio window while pushing apart submixes from different windows, with a bilinear similarity and cross-entropy loss. It further evaluates a factorized input variant using Harmonic-Percussive Separation to disentangle harmony and rhythm contributions. Empirical results on four stem-separated datasets show high coherent-submix classification accuracy and a meaningful correlation with human judgments (MOS), enabling objective benchmarking of accompaniment-generation models and providing checkpoints for reproducibility. The work thus provides a practical, scalable tool for measuring coherence in conditional music generation.

Abstract

We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of the stems composing music tracks and can input features obtained via Harmonic-Percussive Separation (HPS). COCOLA allows the objective evaluation of generative models for music accompaniment generation, which are difficult to benchmark with established metrics. In this regard, we evaluate recent music accompaniment generation models, demonstrating the effectiveness of the proposed method. We release the model checkpoints trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).
Paper Structure (15 sections, 4 equations, 3 figures, 3 tables)

This paper contains 15 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of COCOLA Score. COCOLA is a contrastive model that estimates the coherence between instrumental tracks and generated accompaniments.
  • Figure 2: The COCOLA training procedure (single stem case). Windows of size $L$ are randomly cropped from $K$ tracks (left). Two distinct stems per window are randomly selected (e.g., Guitar $\mathbf{x}^1_1$ and Drums $\mathbf{x}^1_3$ in the first window), embedded using the COCOLA encoder $f_\theta$, yielding latent representations (e.g., $\mathbf{h}^1_1$ and $\mathbf{h}^1_2$). Contrastive loss (Eq. \ref{['eq:cross_entropy']}) is computed with positive pairs within windows and negatives across windows.
  • Figure 3: Correlation plot between subjective scores (MOS) ($x$-axis) and COCOLA Scores ($y$-axis).