CVSM: Contrastive Vocal Similarity Modeling

Christos Garoufis; Athanasia Zlatintsi; Petros Maragos

CVSM: Contrastive Vocal Similarity Modeling

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

TL;DR

CVSM introduces a contrastive self-supervised framework for learning vocal-focused music representations robust to instrumental accompaniment. It employs two sampling schemes—label-informed, pairing isolated vocals with mixtures from the same artist, and label-agnostic, using artificial mixtures—to shape a latent space that captures vocal timbre while remaining invariant to non-vocal content. Empirical results show that label-informed CVSM variants generally provide more consistent downstream performance and user-study quality, though a hybrid label-agnostic approach (real plus artificial mixtures) can closely match label-informed performance in artist identification and perceived vocal similarity. The work advances vocal similarity modeling in realistic musical contexts and highlights practical avenues for vocal retrieval and similarity tasks without heavy reliance on metadata, while suggesting future integration of identity-estimation within sampling strategies.

Abstract

The availability of large, unlabeled datasets across various domains has contributed to the development of a plethora of methods that learn representations for multiple target (downstream) tasks through self-supervised pre-training. In this work, we introduce CVSM (Contrastive Vocal Similarity Modeling), a contrastive self-supervised procedure for music signal representation learning in the audio domain that can be utilized for musical and vocal similarity modeling. Our method operates under a contrastive framework, maximizing the similarity between vocal excerpts and musical mixtures containing the same vocals; we devise both a label-informed protocol, leveraging artist identity information to sample the contrastive pairs, and a label-agnostic scheme, involving artificial mixture creation from randomly sampled vocal and accompaniment excerpts, which are paired with vocals from the same audio segment. We evaluate our proposed method in measuring vocal similarity both objectively, through linear probing on a suite of appropriate downstream tasks, and subjectively, via conducting a user study consisting of pairwise comparisons between different models in a recommendation-by-query setting. Our results indicate that the representations learned through CVSM are effective in musical and vocal similarity modeling, outperforming numerous baselines across both isolated vocals and complete musical mixtures. Moreover, while the availability of artist identity labels during pre-training leads to overall more consistent performance both in the evaluated downstream tasks and the user study, a label-agnostic CVSM variant incorporating hybrid pre-training with real and artificial mixtures achieves comparable performance to the label-informed one in artist identification and perceived vocal similarity.

CVSM: Contrastive Vocal Similarity Modeling

TL;DR

Abstract

CVSM: Contrastive Vocal Similarity Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)