Table of Contents
Fetching ...

Are we describing the same sound? An analysis of word embedding spaces of expressive piano performance

Silvan David Peter, Shreyan Chowdhury, Carlos Eduardo Cancino-Chacón, Gerhard Widmer

TL;DR

This work tackles whether general semantic embeddings can capture a fine-grained, domain-specific adjectival space describing expressive piano performance. It compares five embedding models (ADA, CLAP, EWE, GTE, BGE) against a ground-truth similarity structure derived from the Con Espressione Dataset and expert pile sorting, evaluating factors such as context prompts, hubness reduction, cross-modal text-audio alignment, and clustering. The key finding is that general-purpose embeddings can reach near human inter-rater agreement in this domain, though performance is highly model-dependent and cross-modal or domain-adapted approaches do not consistently outperform general models; hubness mitigation and contextual prompts can improve alignment. Practically, the results inform music information retrieval by highlighting the strengths and limits of current embeddings for domain-specific expressive language, and the study provides reproducible resources for further exploration in fine-grained lexical spaces across domains.

Abstract

Semantic embeddings play a crucial role in natural language-based information retrieval. Embedding models represent words and contexts as vectors whose spatial configuration is derived from the distribution of words in large text corpora. While such representations are generally very powerful, they might fail to account for fine-grained domain-specific nuances. In this article, we investigate this uncertainty for the domain of characterizations of expressive piano performance. Using a music research dataset of free text performance characterizations and a follow-up study sorting the annotations into clusters, we derive a ground truth for a domain-specific semantic similarity structure. We test five embedding models and their similarity structure for correspondence with the ground truth. We further assess the effects of contextualizing prompts, hubness reduction, cross-modal similarity, and k-means clustering. The quality of embedding models shows great variability with respect to this task; more general models perform better than domain-adapted ones and the best model configurations reach human-level agreement.

Are we describing the same sound? An analysis of word embedding spaces of expressive piano performance

TL;DR

This work tackles whether general semantic embeddings can capture a fine-grained, domain-specific adjectival space describing expressive piano performance. It compares five embedding models (ADA, CLAP, EWE, GTE, BGE) against a ground-truth similarity structure derived from the Con Espressione Dataset and expert pile sorting, evaluating factors such as context prompts, hubness reduction, cross-modal text-audio alignment, and clustering. The key finding is that general-purpose embeddings can reach near human inter-rater agreement in this domain, though performance is highly model-dependent and cross-modal or domain-adapted approaches do not consistently outperform general models; hubness mitigation and contextual prompts can improve alignment. Practically, the results inform music information retrieval by highlighting the strengths and limits of current embeddings for domain-specific expressive language, and the study provides reproducible resources for further exploration in fine-grained lexical spaces across domains.

Abstract

Semantic embeddings play a crucial role in natural language-based information retrieval. Embedding models represent words and contexts as vectors whose spatial configuration is derived from the distribution of words in large text corpora. While such representations are generally very powerful, they might fail to account for fine-grained domain-specific nuances. In this article, we investigate this uncertainty for the domain of characterizations of expressive piano performance. Using a music research dataset of free text performance characterizations and a follow-up study sorting the annotations into clusters, we derive a ground truth for a domain-specific semantic similarity structure. We test five embedding models and their similarity structure for correspondence with the ground truth. We further assess the effects of contextualizing prompts, hubness reduction, cross-modal similarity, and k-means clustering. The quality of embedding models shows great variability with respect to this task; more general models perform better than domain-adapted ones and the best model configurations reach human-level agreement.
Paper Structure (14 sections, 3 equations, 5 figures, 2 tables)

This paper contains 14 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Multidimensional Scaling (MDS) of the term data based the equally weighted pile group one, pile group two, and performance similarities. The legend on the left lists all piles of both groups, first the group number, then the names the musicians assigned them. Each term of the 150 in our ground truth data is shown in the scatter plot to the right and colored by the two piles it was sorted into, one for group one (large dots), one for group two (small dots). The musicians did not rate any similarities between piles, the color progressions for the piles do not encode closeness.
  • Figure 2: Box plots of distributions of pairwise similarities in various embedding spaces. The main embedding models tested are labelled as EWE, CLAP, ADA, GTE, and BGE li2023generalOpenAI_2022BAAI_2022elizalde2023clapagrawal2018learning. The three leftmost distributions relate to cross-modal audio and text embeddings as discussed in Section \ref{['sec:audio']} and the distributions labeled with "context" are addressed in Section \ref{['sec:context']}.
  • Figure 3: Left plot: aP@k for k $\in \{1, ..., 49\}$ for several embeddings models against the ground truth similarities. Right plot: aP@k of 45 performance embeddings represented as CLAP audio embeddings and as mean CLAP text embeddings of terms (with and without context prompts).
  • Figure 4: Left plot: relative change in aP@k brought about by the inclusion of contextualizing prompts. Right plot: relative change in aP@k due to hubness reduction at neighborhoods of size eight.
  • Figure 5: Visualization of the convex hull of terms of each pile as embedded in the ground truth data. MDS dimension reduction for illustrative purposes only, 8+ dimensions are required to represent the space with minimal loss of information (<10% reduction in aP@k against original data). The pile centers are shown as average term embedding positions and annotated with the pile names given by the musicians.