F-StrIPE: Fast Structure-Informed Positional Encoding for Symbolic Music Generation
Manvi Agarwal, Changhong Wang, Gael Richard
TL;DR
This work tackles the challenge of long-range coherence in symbolic music generation by introducing F-StrIPE, a fast structure-informed positional encoding that operates with linear complexity using kernel-based Random Fourier Features. By replacing standard time indices with multi-resolution structural labels and unifying structure-aware sinusoidal features with RFF, the method generalizes Stochastic Positional Encoding (SPE) to incorporate musical priors without incurring the usual quadratic cost. Empirical results on melody harmonization demonstrate that structure-informed PEs, especially at the chord level, substantially improve metrics of musicality (CS, GS, NDD) over NoPE and SPE baselines, while maintaining efficiency. The findings highlight the practical impact of combining domain priors with kernel-based attention for scalable, high-quality symbolic music generation, and point to future work on richer structures and mixed-approximation strategies.
Abstract
While music remains a challenging domain for generative models like Transformers, recent progress has been made by exploiting suitable musically-informed priors. One technique to leverage information about musical structure in Transformers is inserting such knowledge into the positional encoding (PE) module. However, Transformers carry a quadratic cost in sequence length. In this paper, we propose F-StrIPE, a structure-informed PE scheme that works in linear complexity. Using existing kernel approximation techniques based on random features, we show that F-StrIPE is a generalization of Stochastic Positional Encoding (SPE). We illustrate the empirical merits of F-StrIPE using melody harmonization for symbolic music.
