Table of Contents
Fetching ...

RenderBox: Expressive Performance Rendering with Text Control

Huan Zhang, Akira Maezawa, Simon Dixon

TL;DR

RenderBox proposes a text-and-score conditioned expressive performance framework that extends a diffusion-transformer backbone with MIDI token conditioning to generate natural, controllable audio across multiple instruments. It introduces a curriculum-based training schedule spanning five stages from synthesis to style-directed performance, enabling controlled variance in speed, mistakes, and stylistic direction. Empirical results show improvements in objective audio metrics (FAD, CLAP), pitch and tempo accuracy, and subjective engagement, supported by an interpretable Performer-Piece embedding space. The work enables flexible, multi-instrument expressive rendering with practical applications in education and creative systems, while noting timbre-focused data limitations and avenues for instrument transfer and richer MIDI cues in future work.

Abstract

Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls through natural language descriptions and granular-level controls using music scores. Based on a diffusion transformer architecture and cross-attention joint conditioning, we propose a curriculum-based paradigm that trains from plain synthesis to expressive performance, gradually incorporating controllable factors such as speed, mistakes, and style diversity. RenderBox achieves high performance compared to baseline models across key metrics such as FAD and CLAP, and also tempo and pitch accuracy under different prompting tasks. Subjective evaluation further demonstrates that RenderBox is able to generate controllable expressive performances that sound natural and musically engaging, aligning well with prompts and intent.

RenderBox: Expressive Performance Rendering with Text Control

TL;DR

RenderBox proposes a text-and-score conditioned expressive performance framework that extends a diffusion-transformer backbone with MIDI token conditioning to generate natural, controllable audio across multiple instruments. It introduces a curriculum-based training schedule spanning five stages from synthesis to style-directed performance, enabling controlled variance in speed, mistakes, and stylistic direction. Empirical results show improvements in objective audio metrics (FAD, CLAP), pitch and tempo accuracy, and subjective engagement, supported by an interpretable Performer-Piece embedding space. The work enables flexible, multi-instrument expressive rendering with practical applications in education and creative systems, while noting timbre-focused data limitations and avenues for instrument transfer and richer MIDI cues in future work.

Abstract

Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls through natural language descriptions and granular-level controls using music scores. Based on a diffusion transformer architecture and cross-attention joint conditioning, we propose a curriculum-based paradigm that trains from plain synthesis to expressive performance, gradually incorporating controllable factors such as speed, mistakes, and style diversity. RenderBox achieves high performance compared to baseline models across key metrics such as FAD and CLAP, and also tempo and pitch accuracy under different prompting tasks. Subjective evaluation further demonstrates that RenderBox is able to generate controllable expressive performances that sound natural and musically engaging, aligning well with prompts and intent.

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of the performance space proposed by our paradigm, progressively from strict to variant relative to the input MIDI score.
  • Figure 2: ControlNet conditioning (left) and concatenative cross-attention conditioning (right), with color highlighting the initialization of modules and their optimization in our experiments.
  • Figure 3: MOS score of the subjective evaluation on the four dimensions, separated by participant's experience.
  • Figure 4: Input MIDI piano rolls and output spectrograms with respect to different text prompting. All visualizations are 20-second windows.
  • Figure 5: t-SNE visualization of generation with testing data subset, colored by performers and shaped by composers.