Table of Contents
Fetching ...

Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics

Hanwen Zhang, Kun Fang, Ziyu Wang, Ichiro Fujinaga

TL;DR

The paper tackles the inadequacy of frame-level metrics in evaluating continuous sustain pedal depth estimation by introducing a musically informed framework with action-based and gesture-based analyses. It develops and compares three Transformer-based baselines (Binary, Audio, Audio+MIDI) and defines outputs $x_{1:T} \in [0,1]$, $o_{1:T}$, $f_{1:T}$, $g \in [0,1]$ under a fixed-weight multi-task loss $\mathcal{L}_{total}$. Experiments on MAESTRO demonstrate that continuous-valued predictions yield clearer action boundaries and gesture contours, with MIDI-informed models achieving the best results in action- and gesture-level metrics, even when frame-level gains are modest. The study emphasizes the practical value of musically informed evaluation for pedal depth tasks and suggests future work on boundary-sensitive objectives, contour-aware losses, and perceptual validation.

Abstract

Evaluation for continuous piano pedal depth estimation tasks remains incomplete when relying only on conventional frame-level metrics, which overlook musically important features such as direction-change boundaries and pedal curve contours. To provide more interpretable and musically meaningful insights, we propose an evaluation framework that augments standard frame-level metrics with an action-level assessment measuring direction and timing using segments of press/hold/release states and a gesture-level analysis that evaluates contour similarity of each press-release cycle. We apply this framework to compare an audio-only baseline with two variants: one incorporating symbolic information from MIDI, and another trained in a binary-valued setting, all within a unified architecture. Results show that the MIDI-informed model significantly outperforms the others at action and gesture levels, despite modest frame-level gains. These findings demonstrate that our framework captures musically relevant improvements indiscernible by traditional metrics, offering a more practical and effective approach to evaluating pedal depth estimation models.

Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics

TL;DR

The paper tackles the inadequacy of frame-level metrics in evaluating continuous sustain pedal depth estimation by introducing a musically informed framework with action-based and gesture-based analyses. It develops and compares three Transformer-based baselines (Binary, Audio, Audio+MIDI) and defines outputs , , , under a fixed-weight multi-task loss . Experiments on MAESTRO demonstrate that continuous-valued predictions yield clearer action boundaries and gesture contours, with MIDI-informed models achieving the best results in action- and gesture-level metrics, even when frame-level gains are modest. The study emphasizes the practical value of musically informed evaluation for pedal depth tasks and suggests future work on boundary-sensitive objectives, contour-aware losses, and perceptual validation.

Abstract

Evaluation for continuous piano pedal depth estimation tasks remains incomplete when relying only on conventional frame-level metrics, which overlook musically important features such as direction-change boundaries and pedal curve contours. To provide more interpretable and musically meaningful insights, we propose an evaluation framework that augments standard frame-level metrics with an action-level assessment measuring direction and timing using segments of press/hold/release states and a gesture-level analysis that evaluates contour similarity of each press-release cycle. We apply this framework to compare an audio-only baseline with two variants: one incorporating symbolic information from MIDI, and another trained in a binary-valued setting, all within a unified architecture. Results show that the MIDI-informed model significantly outperforms the others at action and gesture levels, despite modest frame-level gains. These findings demonstrate that our framework captures musically relevant improvements indiscernible by traditional metrics, offering a more practical and effective approach to evaluating pedal depth estimation models.

Paper Structure

This paper contains 14 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustration of the three-level analysis proposed and used in this work. (a) Frame: each time step is treated as the basic unit. (b) Action: consecutive frames sharing the same direction/state are merged into segments labeled press (yellow), hold (pink), or release (blue). (c) Gesture: complete pedal press-release cycles, characterized and classified according to canonical contour shapes.
  • Figure 2: Two examples with large MSE and MAE but correctly following the pattern. The top one has MSE 0.0983 and MAE 0.2425. The bottom one has MSE 0.0866 and MAE 0.2053.
  • Figure 3: Overall distribution of pedal actions (top) and gestures (bottom) across models. Each row corresponds to one model and the ground truth; bars are stacked by category, with percentages computed from summed frame counts in each category. Labels show per-category proportions.