Structure-informed Positional Encoding for Music Generation

Manvi Agarwal; Changhong Wang; Gaël Richard

Structure-informed Positional Encoding for Music Generation

Manvi Agarwal, Changhong Wang, Gaël Richard

TL;DR

The paper presents StructurePE, a framework that injects hierarchical musical structure into Transformer positional encoding through three variants: StructureAPE (S-APE), StructureRPE (S-RPE), and Nonstationary StructureRPE (NS-RPE). It evaluates these methods on two symbolic music generation tasks (next-timestep prediction and accompaniment generation) using the POP909 dataset, with extensive baselines including NoPE, APE, RPE, and other structure-aware approaches. Results show that structure-informed encodings, particularly S-APE and NS-RPE, improve structural metrics such as self-similarity and chroma consistency, and enhance accompaniment quality, while NoPE can perform competitively and should be considered a baseline in future work. The work provides code, refined dataset alignments, and audio examples, contributing to more coherent and musically faithful generated compositions.

Abstract

Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.

Structure-informed Positional Encoding for Music Generation

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 2 figures, 1 table)

This paper contains 13 sections, 4 equations, 2 figures, 1 table.

Introduction
Methods
Input Representation
Positional Encoding
Structure-informed Positional Encoding
Experiments
Task setup
Model and Dataset
Baselines
Post-processing: Binarization and Velocity-encoding
Evaluation metrics
Results and Discussion
Conclusion

Figures (2)

Figure 1: Illustrative schematic of PEs - both baselines (NoPE, APE and RPE) and ours (rest) - and their use in Transformers. See Sections \ref{['sec:pe']} and \ref{['sec:structpe']} for details.
Figure 2: Comparison of self-similarity matrices from music generated by baselines (columns 2 and 3) and our methods (columns 4 and 5). Best viewed in colour.

Structure-informed Positional Encoding for Music Generation

TL;DR

Abstract

Structure-informed Positional Encoding for Music Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)