Table of Contents
Fetching ...

MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision

Jiatao Chen, Tianming Xie, Xing Tang, Jing Wang, Wenjing Dong, Bing Shi

TL;DR

This work tackles the challenge of generating Chinese traditional melodies with accurate modal expression by introducing MusicMamba, a Dual-Feature Modeling Module that blends the Mamba Block’s long-range dependency modeling with a Transformer Block’s global-structure capture, and a Bidirectional Mamba Fusion Layer to integrate local and global information. It extends the REMI representation with REMI-M, adding mode-related events and note-type markers to better encode modes, and provides FolkDB, a high-quality Chinese traditional music dataset of over 11 hours. The generation process is formulated as an autoregressive task p(y|x,M) decomposed into mode subsequences M_i and their transitions f(M_i), enabling explicit handling of modal structure. Experimental results show improved mode consistency, pitch entropy alignment, and subjective coherence, richness, and style, indicating that REMI-M with MusicMamba more accurately reproduces Chinese folk modal characteristics and offers a robust baseline for future modal ethnomusicology research. The approach has practical implications for culturally aware AI composition and the study of modal structures across ethnic musical traditions.

Abstract

In recent years, deep learning has significantly advanced the MIDI domain, solidifying music generation as a key application of artificial intelligence. However, existing research primarily focuses on Western music and encounters challenges in generating melodies for Chinese traditional music, especially in capturing modal characteristics and emotional expression. To address these issues, we propose a new architecture, the Dual-Feature Modeling Module, which integrates the long-range dependency modeling of the Mamba Block with the global structure capturing capabilities of the Transformer Block. Additionally, we introduce the Bidirectional Mamba Fusion Layer, which integrates local details and global structures through bidirectional scanning, enhancing the modeling of complex sequences. Building on this architecture, we propose the REMI-M representation, which more accurately captures and generates modal information in melodies. To support this research, we developed FolkDB, a high-quality Chinese traditional music dataset encompassing various styles and totaling over 11 hours of music. Experimental results demonstrate that the proposed architecture excels in generating melodies with Chinese traditional music characteristics, offering a new and effective solution for music generation.

MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision

TL;DR

This work tackles the challenge of generating Chinese traditional melodies with accurate modal expression by introducing MusicMamba, a Dual-Feature Modeling Module that blends the Mamba Block’s long-range dependency modeling with a Transformer Block’s global-structure capture, and a Bidirectional Mamba Fusion Layer to integrate local and global information. It extends the REMI representation with REMI-M, adding mode-related events and note-type markers to better encode modes, and provides FolkDB, a high-quality Chinese traditional music dataset of over 11 hours. The generation process is formulated as an autoregressive task p(y|x,M) decomposed into mode subsequences M_i and their transitions f(M_i), enabling explicit handling of modal structure. Experimental results show improved mode consistency, pitch entropy alignment, and subjective coherence, richness, and style, indicating that REMI-M with MusicMamba more accurately reproduces Chinese folk modal characteristics and offers a robust baseline for future modal ethnomusicology research. The approach has practical implications for culturally aware AI composition and the study of modal structures across ethnic musical traditions.

Abstract

In recent years, deep learning has significantly advanced the MIDI domain, solidifying music generation as a key application of artificial intelligence. However, existing research primarily focuses on Western music and encounters challenges in generating melodies for Chinese traditional music, especially in capturing modal characteristics and emotional expression. To address these issues, we propose a new architecture, the Dual-Feature Modeling Module, which integrates the long-range dependency modeling of the Mamba Block with the global structure capturing capabilities of the Transformer Block. Additionally, we introduce the Bidirectional Mamba Fusion Layer, which integrates local details and global structures through bidirectional scanning, enhancing the modeling of complex sequences. Building on this architecture, we propose the REMI-M representation, which more accurately captures and generates modal information in melodies. To support this research, we developed FolkDB, a high-quality Chinese traditional music dataset encompassing various styles and totaling over 11 hours of music. Experimental results demonstrate that the proposed architecture excels in generating melodies with Chinese traditional music characteristics, offering a new and effective solution for music generation.
Paper Structure (14 sections, 5 equations, 4 figures, 2 tables)

This paper contains 14 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The scores of various models for replicating Chinese folk music with the specific style metric.
  • Figure 2: The proportion of generated music sequences with modes using three encoding schemes: MIDI-Like, REMI, and REMI-M.
  • Figure 3: Illustration of the proposed MusicMamba model.
  • Figure 4: Mode distributions of the original MIDI sequence (Source) and sequences generated by MusicMamba and MT (MusicTransformer). The left panel illustrates the overlapping mode distributions, while the right panel presents the individual mode distributions for each sequence.