Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings
Shreyan Chowdhury, Gerhard Widmer
TL;DR
This work tackles expressivity-aware music performance retrieval, where a user describes the desired expressive character of a rendition. It replaces the standard text-audio embedding pipeline with a piano-domain mid-level audio feature encoder and an emotion-enriched text encoder, projecting both into a common space via a linear mapper to enable retrieval. On the Con Espressione dataset, the proposed approach substantially improves retrieval quality, achieving a mean reciprocal rank (MRR) of up to 0.61 and top-1 accuracy approaching 0.38, outperforming a baseline Music-Text Representation model. Importantly, the mid-level feature dimensions offer interpretable explanations of retrieval decisions, supporting explainability in music search and downstream recommendation tasks.
Abstract
This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does not yield optimal results for this task. By introducing two changes -- one each to the text encoder and the audio encoder -- we demonstrate improved performance on a dataset of piano performances and associated free-text descriptions. On the text side, we use emotion-enriched word embeddings (EWE) and on the audio side, we extract mid-level perceptual features instead of generic audio embeddings. Our results highlight the effectiveness of mid-level perceptual features learnt from music and emotion enriched word embeddings learnt from emotion-labelled text in capturing musical expression in a cross-modal setting. Additionally, our interpretable mid-level features provide a route for introducing explainability in the retrieval and downstream recommendation processes.
