Table of Contents
Fetching ...

Sounding Out Reconstruction Error-Based Evaluation of Generative Models of Expressive Performance

Silvan David Peter, Carlos Eduardo Cancino-Chacón, Emmanouil Karystinaios, Gerhard Widmer

TL;DR

This paper questions the reliability of reconstruction-error-based evaluation (REE) for generative models of expressive piano performance, showing that distance to human references may not align with perceptual similarity and that results vary with reference choice and piece. It proposes a framework that encodes performances as time-series of expressive features and uses a ball-of-experts randomization to generate controlled negatives, then tests perceptual discernment via listening experiments and assesses REE reliability and validity across references and pieces. Key findings include limited listener discernment for some features, substantial variability in reliability across pieces, and mixed validity where REE sometimes favors random performances. The work highlights the need for more nuanced quantitative evaluation methods, such as shorter, more consistent excerpts and distributional or learned discriminative approaches, to better capture perceptual equivalence in expressive performance modeling.

Abstract

Generative models of expressive piano performance are usually assessed by comparing their predictions to a reference human performance. A generative algorithm is taken to be better than competing ones if it produces performances that are closer to a human reference performance. However, expert human performers can (and do) interpret music in different ways, making for different possible references, and quantitative closeness is not necessarily aligned with perceptual similarity, raising concerns about the validity of this evaluation approach. In this work, we present a number of experiments that shed light on this problem. Using precisely measured high-quality performances of classical piano music, we carry out a listening test indicating that listeners can sometimes perceive subtle performance difference that go unnoticed under quantitative evaluation. We further present tests that indicate that such evaluation frameworks show a lot of variability in reliability and validity across different reference performances and pieces. We discuss these results and their implications for quantitative evaluation, and hope to foster a critical appreciation of the uncertainties involved in quantitative assessments of such performances within the wider music information retrieval (MIR) community.

Sounding Out Reconstruction Error-Based Evaluation of Generative Models of Expressive Performance

TL;DR

This paper questions the reliability of reconstruction-error-based evaluation (REE) for generative models of expressive piano performance, showing that distance to human references may not align with perceptual similarity and that results vary with reference choice and piece. It proposes a framework that encodes performances as time-series of expressive features and uses a ball-of-experts randomization to generate controlled negatives, then tests perceptual discernment via listening experiments and assesses REE reliability and validity across references and pieces. Key findings include limited listener discernment for some features, substantial variability in reliability across pieces, and mixed validity where REE sometimes favors random performances. The work highlights the need for more nuanced quantitative evaluation methods, such as shorter, more consistent excerpts and distributional or learned discriminative approaches, to better capture perceptual equivalence in expressive performance modeling.

Abstract

Generative models of expressive piano performance are usually assessed by comparing their predictions to a reference human performance. A generative algorithm is taken to be better than competing ones if it produces performances that are closer to a human reference performance. However, expert human performers can (and do) interpret music in different ways, making for different possible references, and quantitative closeness is not necessarily aligned with perceptual similarity, raising concerns about the validity of this evaluation approach. In this work, we present a number of experiments that shed light on this problem. Using precisely measured high-quality performances of classical piano music, we carry out a listening test indicating that listeners can sometimes perceive subtle performance difference that go unnoticed under quantitative evaluation. We further present tests that indicate that such evaluation frameworks show a lot of variability in reliability and validity across different reference performances and pieces. We discuss these results and their implications for quantitative evaluation, and hope to foster a critical appreciation of the uncertainties involved in quantitative assessments of such performances within the wider music information retrieval (MIR) community.
Paper Structure (20 sections, 2 equations, 3 figures, 2 tables)

This paper contains 20 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic representation of our framework for two model evaluation. These frameworks are commonly used for the comparison of two or more candidate models of expressive performance. In our experiments, however, the models are specifically designed for their known ground truth wrt evaluation (in the sense discussed in \ref{['sec:evaluation']}): Model 1 only produces expert performances (purple), model 2 only randomly sampled performances (orange), i.e. model 1 is the musically valid one. The two models produce a performance each ($P_1$ and $P_2$). The MSE of the performances with respect to an expert reference performance ($RP$) is measured ($E_1$ and $E_2$, row 3). The comparison of error terms (row 4) outputs a Boolean decision value (red).
  • Figure 2: At top: excerpt of Mozart's Piano Sonata K 331 Mv. 1 (Bars 5--8). Middle: non-standardized tempo curves. Bottom: mean-log standardized tempo curves (see Section \ref{['sec:tce']}). Colored lines represent tempo curves from the Vienna 4x22 dataset for the Mozart excerpt; black curves represent averaged tempo curves; red lines are randomly generated (non-musical) performances. The gray shaded area indicates one standard deviation above and below the average curve.
  • Figure 3: Illustration of the sampling process approximating the ball of expert performances with a mixture of three Guassian random variables. The average performance (opaque blue, top) is computed from expert performances (translucent blue, top) and segmented into quantiles (red boxes). A randomized performance (orange, bottom) is then sampled from Gaussian distribution for each quantile, with a standard deviation controlled as noise level parameter.