Table of Contents
Fetching ...

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

Wei Zeng, Xian He, Ye Wang

TL;DR

This work tackles end-to-end piano audio-to-score transcription (A2S) by introducing a Seq2Seq model with a hierarchical decoder that produces bar-level metadata (key and time signatures) and note-level sequences for both staves. To close the realism gap between synthetic data and real human performance, it uses a two-stage training regime—pre-training on expressive-performance-rendered synthetic data and fine-tuning on ASAP piano recordings—plus a Kern-based score representation with a pre-processing stream that preserves voicing. Key contributions include the hierarchical multi-task architecture, the two-stage training pipeline, and a voicing-preserving Kern serialization method, all validated against synthetic and real-world data and showing improved MV2H and WER metrics over baselines. The approach advances real-world A2S applicability, enabling more accurate transcription of polyphonic piano music and facilitating downstream music analysis and practice tools.

Abstract

Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

TL;DR

This work tackles end-to-end piano audio-to-score transcription (A2S) by introducing a Seq2Seq model with a hierarchical decoder that produces bar-level metadata (key and time signatures) and note-level sequences for both staves. To close the realism gap between synthetic data and real human performance, it uses a two-stage training regime—pre-training on expressive-performance-rendered synthetic data and fine-tuning on ASAP piano recordings—plus a Kern-based score representation with a pre-processing stream that preserves voicing. Key contributions include the hierarchical multi-task architecture, the two-stage training pipeline, and a voicing-preserving Kern serialization method, all validated against synthetic and real-world data and showing improved MV2H and WER metrics over baselines. The approach advances real-world A2S applicability, enabling more accurate transcription of polyphonic piano music and facilitating downstream music analysis and practice tools.

Abstract

Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.
Paper Structure (27 sections, 4 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The proposed piano A2S model with a hierarchical decoder to transcribe the audio into both bar-level and note-level information. The model first encodes a Variable-Q Transform spectrogram $X$ into audio context representation $c$. Subsequently, a bar-level Decoder decodes $c$ into bar-level representation $b_i$ for the $i$-th bar. Time signatures ($\hat{\boldsymbol{y}}^t$) and keys ($\hat{\boldsymbol{y}}^k$) are transcribed at this bar-level given bar-level representations $\boldsymbol{b}$, while two note-level Decoders further decode $\boldsymbol{b}$ into note sequences of the lower staff ($\hat{\boldsymbol{y}}^n_{lower}$) and upper staff ($\hat{\boldsymbol{y}}^n_{upper}$), respectively.
  • Figure 2: A sample excerpt of multiple voicing: On the left, **Kern representation. On the center, the original score. On the right, the voicing structure of the excerpt. After pre-processing, the lower staff is serialized as: $\{4, EE-, \langle b \rangle, 4, E-, \backslash n, 4, r ...\}$, and the upper staff is serialized as :$\{16, cc, \backslash t, 4, g, \backslash n, 8, b-, ...\}$, where $\langle b \rangle$ is the token for a black space. The identifiers are manually added as post-processing.
  • Figure 3: The two-stage training scheme including pre-training on synthetic data from an EPR system, and fine-tuning on human recordings.
  • Figure 4: The confusion matrices of key (upper) and time signature (lower) of Ours (EPR) in the pre-training stage.
  • Figure 5: A sample score (upper) and its transcription result (lower) from the fine-tuned Ours (EPR) model. The excerpt is selected from Prelude and Fugue in D minor, BWV 875 (Bach, Johann Sebastian), performed by HONG04M.