Table of Contents
Fetching ...

Segment-Factorized Full-Song Generation on Symbolic Piano Music

Ping-Yi Chen, Chih-Pin Tan, Yi-Hsuan Yang

TL;DR

The paper addresses full-song symbolic piano generation by balancing long-range structural coherence with local motif development. It introduces Segmented Full-Song Generation (SFS), which decomposes a song into segments and uses a Transformer-based generator conditioned on four contextual sources (Left, Right, Seed, Ref) plus a global summary encoder $G$, coupled with frame-based tokenization and specialized positional encodings. The key contributions are a factorized joint probability framework, selective attention to context segments, demonstrated improvements in seed adherence and structural coherence, and a real-time generation capability that enables interactive human–AI composition, along with open-source code, weights, and a web interface. The work has practical impact for interactive music creation and scalable full-song generation, offering a path toward more natural human–AI collaboration in symbolic music.

Abstract

We propose the Segmented Full-Song Model (SFS) for symbolic full-song generation. The model accepts a user-provided song structure and an optional short seed segment that anchors the main idea around which the song is developed. By factorizing a song into segments and generating each one through selective attention to related segments, the model achieves higher quality and efficiency compared to prior work. To demonstrate its suitability for human-AI interaction, we further wrap SFS into a web application that enables users to iteratively co-create music on a piano roll with customizable structures and flexible ordering.

Segment-Factorized Full-Song Generation on Symbolic Piano Music

TL;DR

The paper addresses full-song symbolic piano generation by balancing long-range structural coherence with local motif development. It introduces Segmented Full-Song Generation (SFS), which decomposes a song into segments and uses a Transformer-based generator conditioned on four contextual sources (Left, Right, Seed, Ref) plus a global summary encoder , coupled with frame-based tokenization and specialized positional encodings. The key contributions are a factorized joint probability framework, selective attention to context segments, demonstrated improvements in seed adherence and structural coherence, and a real-time generation capability that enables interactive human–AI composition, along with open-source code, weights, and a web interface. The work has practical impact for interactive music creation and scalable full-song generation, offering a path toward more natural human–AI collaboration in symbolic music.

Abstract

We propose the Segmented Full-Song Model (SFS) for symbolic full-song generation. The model accepts a user-provided song structure and an optional short seed segment that anchors the main idea around which the song is developed. By factorizing a song into segments and generating each one through selective attention to related segments, the model achieves higher quality and efficiency compared to prior work. To demonstrate its suitability for human-AI interaction, we further wrap SFS into a web application that enables users to iteratively co-create music on a piano roll with customizable structures and flexible ordering.

Paper Structure

This paper contains 16 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: (a) Generative process with $(\hat{s}_1,\hat{e}_1)=(1,2)$, $(\hat{s}_2,\hat{e}_2)=(3,4)$, $(\hat{s}_3,\hat{e}_3)=(5,5)$, $(\hat{s}_4,\hat{e}_4)=(6,7)$, $\hat{l}_{1:4}=(A,B,A,B)$, and $o_{1:4}=(2,1,4,3)$. See Section \ref{['Segment-Factorized Full-Song Generation']} for notation. (b) Example of our music language: orange refers to frame tokens, blue refers to note tokens, and gray refers to inferred positions. The [Duration 0] token indicates that a note’s offset is set by the next onset of the same pitch or the next bar line.
  • Figure 2: Model architecture of the Segmented Full-Song Model. At the output heads of the Generator (bottom middle), pitch, velocity, and duration are generated sequentially due to their dependencies. During training, the velocity classifier receives the ground-truth pitch, and the duration classifier receives the ground-truth pitch and velocity. During inference, they instead receive sampled values.
  • Figure 3: The segmentation result of song Hikaru Nara - Your Lie in April OP1 [Piano](https://www.youtube.com/watch?v=zsVAbS8xmaU) using our algorithm with different settings of $\alpha$ and $k$. The setting we actually use for this song $\alpha=0.7$ and $k=6$. The heatmap shows the similarity matrix of the song, where purple to yellow indicates 0 to 1.
  • Figure 4: Start–End positional encoding
  • Figure 5: Sub-beat positional encoding. Purple indicates 0, and yellow indicates 1.
  • ...and 1 more figures