Table of Contents
Fetching ...

Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

TL;DR

Pianist Transformer tackles the data-scarcity problem in expressive piano rendering by applying large-scale self-supervised pre-training to a unified MIDI representation learned from unaligned MIDI data. The model employs an efficient asymmetric Transformer with encoder sequence compression, enabling long-context modeling at scale and fast inference. A two-stage training pipeline—self-supervised pre-training on 10B tokens followed by supervised fine-tuning on score-perfor mance pairs—yields state-of-the-art objective metrics and human-level perceptual quality, with Expressive Tempo Mapping making outputs editable in DAWs. Empirical results demonstrate substantial gains from pre-training, robustness across styles, and near-human subjective evaluations, supporting a scalable path toward human-like music performance synthesis. The work also identifies decoder bottlenecks and outlines directions for multi-instrument and language-controllable generation.

Abstract

Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.

Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

TL;DR

Pianist Transformer tackles the data-scarcity problem in expressive piano rendering by applying large-scale self-supervised pre-training to a unified MIDI representation learned from unaligned MIDI data. The model employs an efficient asymmetric Transformer with encoder sequence compression, enabling long-context modeling at scale and fast inference. A two-stage training pipeline—self-supervised pre-training on 10B tokens followed by supervised fine-tuning on score-perfor mance pairs—yields state-of-the-art objective metrics and human-level perceptual quality, with Expressive Tempo Mapping making outputs editable in DAWs. Empirical results demonstrate substantial gains from pre-training, robustness across styles, and near-human subjective evaluations, supporting a scalable path toward human-like music performance synthesis. The work also identifies decoder bottlenecks and outlines directions for multi-instrument and language-controllable generation.

Abstract

Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.

Paper Structure

This paper contains 52 sections, 1 equation, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: The Paradigm Shift in Expressive Piano Performance Rendering.(Left) Previous Supervised Paradigm: Existing systems operate under a strictly supervised pipeline that depends on scarce aligned datasets ($\approx 100$ hours) and cannot exploit the vast in-the-wild MIDI corpus ($>100K$ hours). This reliance on explicit structural features fundamentally limits scalability. (Right) Our Scalable Self-Supervised Paradigm: Pianist Transformer shifts the paradigm by making large-scale self-supervised learning feasible for expressive piano performance rendering. Through the unified MIDI representation, the model can pre-train on over 100K hours of unaligned MIDI to acquire rich musical priors, and then generalize effectively through supervised fine-tuning.
  • Figure 2: The overall architecture and workflow of Pianist Transformer. Our framework processes all MIDI data through a Unified Tokenizer, enabling a two-stage training process. The core model is an asymmetric Transformer with Encoder Sequence Compression for efficient processing of long musical scores. The workflow consists of three stages: (1) Pre-train: The model learns foundational musical context from a massive unlabeled corpus via a masked denoising objective, where it takes a masked token sequence as input and predicts the original sequence.(2) SFT: Supervised Fine-Tuning adapts the model to map musical context to expressive nuances using aligned score-performance pairs, where it takes the score tokens as input and predicts the corresponding performance tokens.(3) Inference: The model takes a score input and then generates a performance, which is then made editable for DAWs by our Expressive Tempo Mapping algorithm.
  • Figure 3: The profound impact of large-scale self-supervised pre-training. We compare our Pianist Transformer against an identical model trained from scratch (w/o PT). (\ref{['fig:intro_js']}, \ref{['fig:intro_intersection']}) Pre-training leads to dramatic improvements in objective metrics that measure distributional similarity to human performances. (\ref{['fig:intro_loss']}) This is rooted in a much better learning foundation, as the pre-trained model converges faster and to a significantly lower loss during fine-tuning.
  • Figure 4: Subjective Preference Ranking Results.The evaluation includes pieces by Haydn (P1), Beethoven (P2), Chopin (P3), and Bach (P4). (\ref{['fig:avg_rank']}) The average rank of our Pianist Transformer is statistically indistinguishable from the Human performance and significantly better than all baselines. (\ref{['fig:first_place_votes']}) Our model achieves a slightly higher first-place vote rate than the human pianist, demonstrating strong listener appeal.
  • Figure 5: Multi-dimensional Subjective Ratings (Normalized). A radar chart visualizing the average scores on a 5-point scale for four expressive dimensions. Pianist Transformer exhibits a profile that closely mirrors the Human performance , indicating a well-balanced and high-quality rendering across all aspects. The area covered by ours is substantially larger than all of other baselines.
  • ...and 6 more figures