COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodolà, Luca Cosmo
TL;DR
COCOLA addresses the lack of objective coherence metrics for stem-level music accompaniment by introducing a coherence-oriented contrastive objective and the COCOLA Score. The method trains a convolutional encoder to maximize agreement between disjoint stem submixes within the same audio window while pushing apart submixes from different windows, with a bilinear similarity $\text{sim}(\mathbf{h}_1,\mathbf{h}_2)=\mathbf{h}_1^\top \mathbf{W} \mathbf{h}_2$ and cross-entropy loss. It further evaluates a factorized input variant using Harmonic-Percussive Separation to disentangle harmony and rhythm contributions. Empirical results on four stem-separated datasets show high coherent-submix classification accuracy and a meaningful correlation with human judgments (MOS), enabling objective benchmarking of accompaniment-generation models and providing checkpoints for reproducibility. The work thus provides a practical, scalable tool for measuring coherence in conditional music generation.
Abstract
We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of the stems composing music tracks and can input features obtained via Harmonic-Percussive Separation (HPS). COCOLA allows the objective evaluation of generative models for music accompaniment generation, which are difficult to benchmark with established metrics. In this regard, we evaluate recent music accompaniment generation models, demonstrating the effectiveness of the proposed method. We release the model checkpoints trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).
