Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics
Hanwen Zhang, Kun Fang, Ziyu Wang, Ichiro Fujinaga
TL;DR
The paper tackles the inadequacy of frame-level metrics in evaluating continuous sustain pedal depth estimation by introducing a musically informed framework with action-based and gesture-based analyses. It develops and compares three Transformer-based baselines (Binary, Audio, Audio+MIDI) and defines outputs $x_{1:T} \in [0,1]$, $o_{1:T}$, $f_{1:T}$, $g \in [0,1]$ under a fixed-weight multi-task loss $\mathcal{L}_{total}$. Experiments on MAESTRO demonstrate that continuous-valued predictions yield clearer action boundaries and gesture contours, with MIDI-informed models achieving the best results in action- and gesture-level metrics, even when frame-level gains are modest. The study emphasizes the practical value of musically informed evaluation for pedal depth tasks and suggests future work on boundary-sensitive objectives, contour-aware losses, and perceptual validation.
Abstract
Evaluation for continuous piano pedal depth estimation tasks remains incomplete when relying only on conventional frame-level metrics, which overlook musically important features such as direction-change boundaries and pedal curve contours. To provide more interpretable and musically meaningful insights, we propose an evaluation framework that augments standard frame-level metrics with an action-level assessment measuring direction and timing using segments of press/hold/release states and a gesture-level analysis that evaluates contour similarity of each press-release cycle. We apply this framework to compare an audio-only baseline with two variants: one incorporating symbolic information from MIDI, and another trained in a binary-valued setting, all within a unified architecture. Results show that the MIDI-informed model significantly outperforms the others at action and gesture levels, despite modest frame-level gains. These findings demonstrate that our framework captures musically relevant improvements indiscernible by traditional metrics, offering a more practical and effective approach to evaluating pedal depth estimation models.
