Table of Contents
Fetching ...

CVSM: Contrastive Vocal Similarity Modeling

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

TL;DR

CVSM introduces a contrastive self-supervised framework for learning vocal-focused music representations robust to instrumental accompaniment. It employs two sampling schemes—label-informed, pairing isolated vocals with mixtures from the same artist, and label-agnostic, using artificial mixtures—to shape a latent space that captures vocal timbre while remaining invariant to non-vocal content. Empirical results show that label-informed CVSM variants generally provide more consistent downstream performance and user-study quality, though a hybrid label-agnostic approach (real plus artificial mixtures) can closely match label-informed performance in artist identification and perceived vocal similarity. The work advances vocal similarity modeling in realistic musical contexts and highlights practical avenues for vocal retrieval and similarity tasks without heavy reliance on metadata, while suggesting future integration of identity-estimation within sampling strategies.

Abstract

The availability of large, unlabeled datasets across various domains has contributed to the development of a plethora of methods that learn representations for multiple target (downstream) tasks through self-supervised pre-training. In this work, we introduce CVSM (Contrastive Vocal Similarity Modeling), a contrastive self-supervised procedure for music signal representation learning in the audio domain that can be utilized for musical and vocal similarity modeling. Our method operates under a contrastive framework, maximizing the similarity between vocal excerpts and musical mixtures containing the same vocals; we devise both a label-informed protocol, leveraging artist identity information to sample the contrastive pairs, and a label-agnostic scheme, involving artificial mixture creation from randomly sampled vocal and accompaniment excerpts, which are paired with vocals from the same audio segment. We evaluate our proposed method in measuring vocal similarity both objectively, through linear probing on a suite of appropriate downstream tasks, and subjectively, via conducting a user study consisting of pairwise comparisons between different models in a recommendation-by-query setting. Our results indicate that the representations learned through CVSM are effective in musical and vocal similarity modeling, outperforming numerous baselines across both isolated vocals and complete musical mixtures. Moreover, while the availability of artist identity labels during pre-training leads to overall more consistent performance both in the evaluated downstream tasks and the user study, a label-agnostic CVSM variant incorporating hybrid pre-training with real and artificial mixtures achieves comparable performance to the label-informed one in artist identification and perceived vocal similarity.

CVSM: Contrastive Vocal Similarity Modeling

TL;DR

CVSM introduces a contrastive self-supervised framework for learning vocal-focused music representations robust to instrumental accompaniment. It employs two sampling schemes—label-informed, pairing isolated vocals with mixtures from the same artist, and label-agnostic, using artificial mixtures—to shape a latent space that captures vocal timbre while remaining invariant to non-vocal content. Empirical results show that label-informed CVSM variants generally provide more consistent downstream performance and user-study quality, though a hybrid label-agnostic approach (real plus artificial mixtures) can closely match label-informed performance in artist identification and perceived vocal similarity. The work advances vocal similarity modeling in realistic musical contexts and highlights practical avenues for vocal retrieval and similarity tasks without heavy reliance on metadata, while suggesting future integration of identity-estimation within sampling strategies.

Abstract

The availability of large, unlabeled datasets across various domains has contributed to the development of a plethora of methods that learn representations for multiple target (downstream) tasks through self-supervised pre-training. In this work, we introduce CVSM (Contrastive Vocal Similarity Modeling), a contrastive self-supervised procedure for music signal representation learning in the audio domain that can be utilized for musical and vocal similarity modeling. Our method operates under a contrastive framework, maximizing the similarity between vocal excerpts and musical mixtures containing the same vocals; we devise both a label-informed protocol, leveraging artist identity information to sample the contrastive pairs, and a label-agnostic scheme, involving artificial mixture creation from randomly sampled vocal and accompaniment excerpts, which are paired with vocals from the same audio segment. We evaluate our proposed method in measuring vocal similarity both objectively, through linear probing on a suite of appropriate downstream tasks, and subjectively, via conducting a user study consisting of pairwise comparisons between different models in a recommendation-by-query setting. Our results indicate that the representations learned through CVSM are effective in musical and vocal similarity modeling, outperforming numerous baselines across both isolated vocals and complete musical mixtures. Moreover, while the availability of artist identity labels during pre-training leads to overall more consistent performance both in the evaluated downstream tasks and the user study, a label-agnostic CVSM variant incorporating hybrid pre-training with real and artificial mixtures achieves comparable performance to the label-informed one in artist identification and perceived vocal similarity.

Paper Structure

This paper contains 18 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of our proposed framework for learning audio representations. Contrastive pairs are generated either using (top left) label-informed sampling, where pairs of musical mixtures (consisting of vocals and instrumental accompaniment) and isolated vocals are sampled from the same artist, or (bottom left) in a label-agnostic manner, by i) creating artificial song mixtures by superimposing the vocals and accompaniments of different song excerpts or b) sampling excerpts from the complete song, and coupling them with time-shifted excerpts of the vocals. These contrastive pairs are then used to pre-train an encoder backbone with a contrastive loss objective (right).
  • Figure 2: Dataset statistics for Music4All: the number of artist identities (left) and the percentage of audio previews in the dataset (right), grouped according to the number of audio previews available for each artist.
  • Figure 3: Performance on the tasks of artist identification (left) and gender identification (right), depending on the length of input context available to the network (in sec).
  • Figure 4: Performance of the obtained frozen embeddings on the task of artist identification, subject to a reduced data regime, when using the full mixture (left) or the vocal excerpts (right) as network input.
  • Figure 5: T-SNE projections of clip-wise average embeddings from various models; blue dots correspond to male singers, orange to female. The top row plots correspond to CVSM-A (top left) and CVSM-AH (top right) variants, the middle row to the label-agnostic baselines COLA (middle left) and MSCOL (middle right), and the bottom row to the label-informed models, COLA-ART (bottom left) and CVSM-ART (bottom right).
  • ...and 3 more figures