Measuring Audio Prompt Adherence with Distribution-based Embedding Distances

Maarten Grachten

Measuring Audio Prompt Adherence with Distribution-based Embedding Distances

Maarten Grachten

TL;DR

This work tackles the lack of a universal metric for evaluating how well generated music adheres to an audio prompt. It proposes a distribution-based framework using embedding-space distances (FAD and MMD) combined with a set of constituents (embedding models, PCA projections, and fusion strategies) to quantify audio prompt adherence, complemented by an explicit adherence score $S^{(\mathcal{M})}_{X}(Y)$ that normalizes distances to a matching reference against a non-matching one. Through three experiments, the authors show that direct application of FAD/MMD to fused embeddings is not robust across music collections, but that a relative-score approach and early fusion with CLAP embeddings (especially with PCA100 whitening) provide strong sensitivity to adherence and perturbations (pitch/time shifts). The findings highlight cross-collection robustness as a key challenge and offer a practical, open-source implementation pathway for evaluating audio-prompt-conditioned music generation, with implications for model development and benchmarking. The work advances the field by delivering a generic, instrument-agnostic metric framework and demonstrating its behavior under controlled perturbations, while outlining directions for improving robustness to acoustic artifacts and loudness effects.

Abstract

An increasing number of generative music models can be conditioned on an audio prompt that serves as musical context for which the model is to create an accompaniment (often further specified using a text prompt). Evaluation of how well model outputs adhere to the audio prompt is often done in a model or problem specific manner, presumably because no generic evaluation method for audio prompt adherence has emerged. Such a method could be useful both in the development and training of new models, and to make performance comparable across models. In this paper we investigate whether commonly used distribution-based distances like Fréchet Audio Distance (FAD), can be used to measure audio prompt adherence. We propose a simple procedure based on a small number of constituents (an embedding model, a projection, an embedding distance, and a data fusion method), that we systematically assess using a baseline validation. In a follow-up experiment we test the sensitivity of the proposed audio adherence measure to pitch and time shift perturbations. The results show that the proposed measure is sensitive to such perturbations, even when the reference and candidate distributions are from different music collections. Although more experimentation is needed to answer unaddressed questions like the robustness of the measure to acoustic artifacts that do not affect the audio prompt adherence, the current results suggest that distribution-based embedding distances provide a viable way of measuring audio prompt adherence. An python/pytorch implementation of the proposed measure is publicly available as a github repository.

Measuring Audio Prompt Adherence with Distribution-based Embedding Distances

TL;DR

that normalizes distances to a matching reference against a non-matching one. Through three experiments, the authors show that direct application of FAD/MMD to fused embeddings is not robust across music collections, but that a relative-score approach and early fusion with CLAP embeddings (especially with PCA100 whitening) provide strong sensitivity to adherence and perturbations (pitch/time shifts). The findings highlight cross-collection robustness as a key challenge and offer a practical, open-source implementation pathway for evaluating audio-prompt-conditioned music generation, with implications for model development and benchmarking. The work advances the field by delivering a generic, instrument-agnostic metric framework and demonstrating its behavior under controlled perturbations, while outlining directions for improving robustness to acoustic artifacts and loudness effects.

Abstract

Paper Structure (20 sections, 3 equations, 4 figures, 5 tables)

This paper contains 20 sections, 3 equations, 4 figures, 5 tables.

Introduction
Related Work
Measures for audio quality
Measures for text prompt adherence
Other evaluation measures for conditional music generation
Method
Baseline evaluation
Data collections
Distance metrics
Embedding models
PCA projection
Prompt/stem fusion method
Experiment 1: Baseline exploration
Procedure
Data processing
...and 5 more sections

Figures (4)

Figure 1: Within-collection (upper three rows) and between-collection (bottom row) FAD/$\textrm{MMD}$ distances of candidate sets to reference sets using different fusion methods, embedders, and projections. For between-collection distances, only the MIX fusion method is shown. The mix/stem pairs of the candidate sets are either matching (blue) or non-matching (orange). Asterisks denote the statistical significance of differences.
Figure 2: Hypothetical constellation of matching and non-matching reference ($X$, $X'$) and candidate sets ($Y$, $Y'$) for between-collection comparisons
Figure 3: [Experiment 2] Within-collection (top) and between-collection (bottom) FAD/$\textrm{MMD}$-based prompt adherence scores $S^{(\mathcal{M})}$ of matching vs non-matching candidate sets
Figure 4: [Experiment 3] Common language effect size (CLES) of different non-matching conditions on audio prompt adherence score.

Measuring Audio Prompt Adherence with Distribution-based Embedding Distances

TL;DR

Abstract

Measuring Audio Prompt Adherence with Distribution-based Embedding Distances

Authors

TL;DR

Abstract

Table of Contents

Figures (4)