Measuring Audio Prompt Adherence with Distribution-based Embedding Distances
Maarten Grachten
TL;DR
This work tackles the lack of a universal metric for evaluating how well generated music adheres to an audio prompt. It proposes a distribution-based framework using embedding-space distances (FAD and MMD) combined with a set of constituents (embedding models, PCA projections, and fusion strategies) to quantify audio prompt adherence, complemented by an explicit adherence score $S^{(\mathcal{M})}_{X}(Y)$ that normalizes distances to a matching reference against a non-matching one. Through three experiments, the authors show that direct application of FAD/MMD to fused embeddings is not robust across music collections, but that a relative-score approach and early fusion with CLAP embeddings (especially with PCA100 whitening) provide strong sensitivity to adherence and perturbations (pitch/time shifts). The findings highlight cross-collection robustness as a key challenge and offer a practical, open-source implementation pathway for evaluating audio-prompt-conditioned music generation, with implications for model development and benchmarking. The work advances the field by delivering a generic, instrument-agnostic metric framework and demonstrating its behavior under controlled perturbations, while outlining directions for improving robustness to acoustic artifacts and loudness effects.
Abstract
An increasing number of generative music models can be conditioned on an audio prompt that serves as musical context for which the model is to create an accompaniment (often further specified using a text prompt). Evaluation of how well model outputs adhere to the audio prompt is often done in a model or problem specific manner, presumably because no generic evaluation method for audio prompt adherence has emerged. Such a method could be useful both in the development and training of new models, and to make performance comparable across models. In this paper we investigate whether commonly used distribution-based distances like Fréchet Audio Distance (FAD), can be used to measure audio prompt adherence. We propose a simple procedure based on a small number of constituents (an embedding model, a projection, an embedding distance, and a data fusion method), that we systematically assess using a baseline validation. In a follow-up experiment we test the sensitivity of the proposed audio adherence measure to pitch and time shift perturbations. The results show that the proposed measure is sensitive to such perturbations, even when the reference and candidate distributions are from different music collections. Although more experimentation is needed to answer unaddressed questions like the robustness of the measure to acoustic artifacts that do not affect the audio prompt adherence, the current results suggest that distribution-based embedding distances provide a viable way of measuring audio prompt adherence. An python/pytorch implementation of the proposed measure is publicly available as a github repository.
