Table of Contents
Fetching ...

Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings

Shreyan Chowdhury, Gerhard Widmer

TL;DR

This work tackles expressivity-aware music performance retrieval, where a user describes the desired expressive character of a rendition. It replaces the standard text-audio embedding pipeline with a piano-domain mid-level audio feature encoder and an emotion-enriched text encoder, projecting both into a common space via a linear mapper to enable retrieval. On the Con Espressione dataset, the proposed approach substantially improves retrieval quality, achieving a mean reciprocal rank (MRR) of up to 0.61 and top-1 accuracy approaching 0.38, outperforming a baseline Music-Text Representation model. Importantly, the mid-level feature dimensions offer interpretable explanations of retrieval decisions, supporting explainability in music search and downstream recommendation tasks.

Abstract

This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does not yield optimal results for this task. By introducing two changes -- one each to the text encoder and the audio encoder -- we demonstrate improved performance on a dataset of piano performances and associated free-text descriptions. On the text side, we use emotion-enriched word embeddings (EWE) and on the audio side, we extract mid-level perceptual features instead of generic audio embeddings. Our results highlight the effectiveness of mid-level perceptual features learnt from music and emotion enriched word embeddings learnt from emotion-labelled text in capturing musical expression in a cross-modal setting. Additionally, our interpretable mid-level features provide a route for introducing explainability in the retrieval and downstream recommendation processes.

Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings

TL;DR

This work tackles expressivity-aware music performance retrieval, where a user describes the desired expressive character of a rendition. It replaces the standard text-audio embedding pipeline with a piano-domain mid-level audio feature encoder and an emotion-enriched text encoder, projecting both into a common space via a linear mapper to enable retrieval. On the Con Espressione dataset, the proposed approach substantially improves retrieval quality, achieving a mean reciprocal rank (MRR) of up to 0.61 and top-1 accuracy approaching 0.38, outperforming a baseline Music-Text Representation model. Importantly, the mid-level feature dimensions offer interpretable explanations of retrieval decisions, supporting explainability in music search and downstream recommendation tasks.

Abstract

This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does not yield optimal results for this task. By introducing two changes -- one each to the text encoder and the audio encoder -- we demonstrate improved performance on a dataset of piano performances and associated free-text descriptions. On the text side, we use emotion-enriched word embeddings (EWE) and on the audio side, we extract mid-level perceptual features instead of generic audio embeddings. Our results highlight the effectiveness of mid-level perceptual features learnt from music and emotion enriched word embeddings learnt from emotion-labelled text in capturing musical expression in a cross-modal setting. Additionally, our interpretable mid-level features provide a route for introducing explainability in the retrieval and downstream recommendation processes.
Paper Structure (9 sections, 2 equations, 3 figures, 2 tables)

This paper contains 9 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A system for retrieving the best-matching performance of a musical piece based on a text description of its expressive character.
  • Figure 2: In our system, the audio and text encoders of a MTR model doh2023toward are replaced by a mid-level feature model and emotion enriched word representation model respectively.
  • Figure 3: Seven performances of Bach's C Major Prelude, with human textual characterisation and corresponding mid-level feature values predicted by mapping $h(g_{\text{EWE}})$. Text and feature values relate to the performance identified by "true = ..."