Table of Contents
Fetching ...

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Ziyang Ma, Liumeng Xue, Ziyu Wang, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Pengfei Li, Jingcheng Wu, Chenghua Lin, Qifeng Liu, Tao Jiang, Wenhao Huang, Wenhu Chen, Emmanouil Benetos, Jie Fu, Gus Xia, Roger Dannenberg, Wei Xue, Shiyin Kang, Yike Guo

TL;DR

ChatMusician introduces a text-based LLM trained to understand and generate music by treating ABC notation as a second language, using continual pretraining and LoRA-based finetuning on LLaMA2. It provides MusicPile, a large music-focused corpus, and MusicTheoryBench, a college-level symbolic-music benchmark to assess knowledge and reasoning. Results show improved musical knowledge and competitive generation quality, with zero-shot music reasoning outperforming GPT-4 in some cases while maintaining language capabilities. The work highlights the potential of LLMs as compressors and generators of structured symbolic music and releases the data, benchmark, and code for open collaboration.

Abstract

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.

ChatMusician: Understanding and Generating Music Intrinsically with LLM

TL;DR

ChatMusician introduces a text-based LLM trained to understand and generate music by treating ABC notation as a second language, using continual pretraining and LoRA-based finetuning on LLaMA2. It provides MusicPile, a large music-focused corpus, and MusicTheoryBench, a college-level symbolic-music benchmark to assess knowledge and reasoning. Results show improved musical knowledge and competitive generation quality, with zero-shot music reasoning outperforming GPT-4 in some cases while maintaining language capabilities. The work highlights the potential of LLMs as compressors and generators of structured symbolic music and releases the data, benchmark, and code for open collaboration.

Abstract

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
Paper Structure (49 sections, 9 figures, 8 tables, 2 algorithms)

This paper contains 49 sections, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: ChatMusician learns from web-sourced musical knowledge and handcrafted music score generation instructions, unifies music generation and music understanding, and can chat, compose, and answer college-level music theory questions.
  • Figure 2: Commonly used music representations, including Wav, Codec, MIDI (visualized as piano roll), and ABC notation. From left to right, the compression rate gets higher.
  • Figure 3: We included diverse music scores from around the world in MusicPile. The distribution of a portion of music scores containing regional information has been marked with blue points on the world map.
  • Figure 4: Simple examples of (a) music knowledge and (b) music reasoning from MusicTheoryBench. Question a. mainly includes concepts that can be answered through memorizing them. Question b. requires the knowledge of descending, natural minor scale and leading tone, and inference based on the musical score.
  • Figure 5: Zero-shot accuracy on MusicTheoryBench. We included GPT-3.5, GPT-4, LLaMA2-7B-Base, ChatMusician-Base, and ChatMusician. The blue bar represents the performance on the music knowledge metric, and the red bar represents the music reasoning metric. The dashed line corresponds to a random baseline, with a score of 25%.
  • ...and 4 more figures