YNote: A Novel Music Notation for Fine-Tuning LLMs in Music Generation
Shao-Chien Lu, Chen-Chen Yeh, Hui-Lin Cho, Chun-Chieh Hsu, Tsai-Ling Hsu, Cheng-Han Wu, Timothy K. Shih, Yu-Cheng Lin
TL;DR
This work addresses the difficulty of fine-tuning LLMs for music generation using existing notations by introducing YNote, a fixed-format, two-component (pitch and duration) notation with a compact two-character encoding. By converting 190 Jiangnan-style tunes into YNote and fine-tuning the GPT-2 (124M) model, the authors demonstrate that a simple prompt (e.g., the first bar or first+last notes) can yield coherent music, achieving BLEU = $0.883$ and ROUGE = $0.766$. The approach requires minimal normalization (roughly $1.6\%$–$2.2\%$ edits) and runs efficiently on a single RTX 4070 Ti, underscoring YNote’s practicality for machine learning. Overall, YNote presents a practical alternative to traditional music notations for ML applications, with potential to enhance style-controlled music generation and learning efficiency.
Abstract
The field of music generation using Large Language Models (LLMs) is evolving rapidly, yet existing music notation systems, such as MIDI, ABC Notation, and MusicXML, remain too complex for effective fine-tuning of LLMs. These formats are difficult for both machines and humans to interpret due to their variability and intricate structure. To address these challenges, we introduce YNote, a simplified music notation system that uses only four characters to represent a note and its pitch. YNote's fixed format ensures consistency, making it easy to read and more suitable for fine-tuning LLMs. In our experiments, we fine-tuned GPT-2 (124M) on a YNote-encoded dataset and achieved BLEU and ROUGE scores of 0.883 and 0.766, respectively. With just two notes as prompts, the model was able to generate coherent and stylistically relevant music. We believe YNote offers a practical alternative to existing music notations for machine learning applications and has the potential to significantly enhance the quality of music generation using LLMs.
