Table of Contents
Fetching ...

Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model

Julia Barnett, Hugo Flores Garcia, Bryan Pardo

TL;DR

This paper tackles training-data attribution for generative music by proposing a replicable framework that identifies influential training data via audio embeddings. It applies CLAP and CLMR embeddings to measure cosine similarity between 3-second clips from generated outputs and a large training corpus (5 million clips), validating the approach with a human ABX listening study. The authors demonstrate robustness to perturbations and reveal that generated pieces (vamps) often resemble training data more than their seed prompts, enabling informed attribution and potential compensatory mechanisms for creators. Overall, the framework advances responsible, data-grounded understanding of model influence and provides a practical tool for model creators and users to scrutinize and contextualize generated music.

Abstract

Every artist has a creative process that draws inspiration from previous artists and their works. Today, "inspiration" has been automated by generative music models. The black box nature of these models obscures the identity of the works that influence their creative output. As a result, users may inadvertently appropriate, misuse, or copy existing artists' works. We establish a replicable methodology to systematically identify similar pieces of music audio in a manner that is useful for understanding training data attribution. A key aspect of our approach is to harness an effective music audio similarity measure. We compare the effect of applying CLMR and CLAP embeddings to similarity measurement in a set of 5 million audio clips used to train VampNet, a recent open source generative music model. We validate this approach with a human listening study. We also explore the effect that modifications of an audio example (e.g., pitch shifting, time stretching, background noise) have on similarity measurements. This work is foundational to incorporating automated influence attribution into generative modeling, which promises to let model creators and users move from ignorant appropriation to informed creation. Audio samples that accompany this paper are available at https://tinyurl.com/exploring-musical-roots.

Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model

TL;DR

This paper tackles training-data attribution for generative music by proposing a replicable framework that identifies influential training data via audio embeddings. It applies CLAP and CLMR embeddings to measure cosine similarity between 3-second clips from generated outputs and a large training corpus (5 million clips), validating the approach with a human ABX listening study. The authors demonstrate robustness to perturbations and reveal that generated pieces (vamps) often resemble training data more than their seed prompts, enabling informed attribution and potential compensatory mechanisms for creators. Overall, the framework advances responsible, data-grounded understanding of model influence and provides a practical tool for model creators and users to scrutinize and contextualize generated music.

Abstract

Every artist has a creative process that draws inspiration from previous artists and their works. Today, "inspiration" has been automated by generative music models. The black box nature of these models obscures the identity of the works that influence their creative output. As a result, users may inadvertently appropriate, misuse, or copy existing artists' works. We establish a replicable methodology to systematically identify similar pieces of music audio in a manner that is useful for understanding training data attribution. A key aspect of our approach is to harness an effective music audio similarity measure. We compare the effect of applying CLMR and CLAP embeddings to similarity measurement in a set of 5 million audio clips used to train VampNet, a recent open source generative music model. We validate this approach with a human listening study. We also explore the effect that modifications of an audio example (e.g., pitch shifting, time stretching, background noise) have on similarity measurements. This work is foundational to incorporating automated influence attribution into generative modeling, which promises to let model creators and users move from ignorant appropriation to informed creation. Audio samples that accompany this paper are available at https://tinyurl.com/exploring-musical-roots.
Paper Structure (27 sections, 2 figures, 3 tables)

This paper contains 27 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Plots of various amounts of noise perturbations to clips and the percent of the time they were returned in the top $k=10$, $k=5$, and $k=1$ song using our methodology. Analyzed for both CLAP (left column) and CLMR (right column) embeddings. Displays, from top to bottom, pitch shift in semitones, time stretch as percent shortened/elongated, white noise overlay in decibBels to target clip, and mash-ups of 2 songs in training data, 1 song in training data and one random, and a prompt song and its generated vamp.
  • Figure 2: Example question participants had in our subjective evaluation.