Table of Contents
Fetching ...

Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation

Jinhan Xu, Xing Tang, Houpeng Yang, Haoran Zhang, Shenghua Yuan, Jiatao Chen, Tianming Xi, Jing Wang, Jiaojiao Yu, Guangli Xiang

TL;DR

Experiments show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles.

Abstract

Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.

Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation

TL;DR

Experiments show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles.

Abstract

Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.
Paper Structure (26 sections, 13 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: A score example illustrating the absorbing state diffusion process.
  • Figure 2: The diagram shows SMDIM on the left and its core component, the MFA Block on the right. SMDIM processes input sequences hierarchically, while the MFA Block combines Mamba, FeedForward, and self-attention layers to balance scalability and precision. The [Mask] tokens represent noise, which is transformed into music symbols through denoising, resulting in coherent musical sequences.
  • Figure 3: Ablation results on GFLOPs of SMDIM model vs. SCHmUBERT model at different input seq. lengths. All GFLOPs are calculated by thop package for comparison.
  • Figure 4: Average Overlapping Area (OA) of music generated by SMDIM under different output sequence lengths.
  • Figure 5: Representative failure cases generated by the proposed model. (a)--(b) Failure cases exhibiting extreme pitch ranges and overly dense vertical note stacking, resulting in musically implausible symbolic structures. The regions highlighted in red and blue indicate representative local errors, including abnormal pitch outliers and excessive simultaneous note activations. (c) A failure case from the final portion of a generated sequence, illustrating structural degradation and weakened global musical organization over time.