Structure-informed Positional Encoding for Music Generation
Manvi Agarwal, Changhong Wang, Gaël Richard
TL;DR
The paper presents StructurePE, a framework that injects hierarchical musical structure into Transformer positional encoding through three variants: StructureAPE (S-APE), StructureRPE (S-RPE), and Nonstationary StructureRPE (NS-RPE). It evaluates these methods on two symbolic music generation tasks (next-timestep prediction and accompaniment generation) using the POP909 dataset, with extensive baselines including NoPE, APE, RPE, and other structure-aware approaches. Results show that structure-informed encodings, particularly S-APE and NS-RPE, improve structural metrics such as self-similarity and chroma consistency, and enhance accompaniment quality, while NoPE can perform competitively and should be considered a baseline in future work. The work provides code, refined dataset alignments, and audio examples, contributing to more coherent and musically faithful generated compositions.
Abstract
Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.
