Table of Contents
Fetching ...

Foundation Models for Music: A Survey

Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan, Shangda Wu, Shih-Lun Wu, Shuqi Dai, Shun Lei, Shiyin Kang, Simon Dixon, Wenhu Chen, Wenhao Huang, Xingjian Du, Xingwei Qu, Xu Tan, Yizhi Li, Zeyue Tian, Zhiyong Wu, Zhizheng Wu, Ziyang Ma, Ziyu Wang

TL;DR

This survey maps the advent of foundation models in music, detailing representations from acoustic to symbolic and multimodal modalities, and outlines how self-supervised pretraining and diffusion/transformer architectures enable music understanding, generation, and therapy applications. It categorizes pretraining paradigms (contrastive, generative, masked modelling), discusses domain adaptation techniques, tokenisers, and model architectures, and highlights music agents and scaling laws. A comprehensive appraisal of datasets, evaluation protocols, and ethical considerations (copyright, transparency, bias, and personality rights) is presented, identifying gaps such as long-sequence modelling and domain knowledge integration. The work seeks to guide researchers toward robust, diverse, and responsible development of music foundation models with attention to cultural representation and societal impact.

Abstract

In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.

Foundation Models for Music: A Survey

TL;DR

This survey maps the advent of foundation models in music, detailing representations from acoustic to symbolic and multimodal modalities, and outlines how self-supervised pretraining and diffusion/transformer architectures enable music understanding, generation, and therapy applications. It categorizes pretraining paradigms (contrastive, generative, masked modelling), discusses domain adaptation techniques, tokenisers, and model architectures, and highlights music agents and scaling laws. A comprehensive appraisal of datasets, evaluation protocols, and ethical considerations (copyright, transparency, bias, and personality rights) is presented, identifying gaps such as long-sequence modelling and domain knowledge integration. The work seeks to guide researchers toward robust, diverse, and responsible development of music foundation models with attention to cultural representation and societal impact.

Abstract

In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
Paper Structure (134 sections, 17 equations, 9 figures, 7 tables)

This paper contains 134 sections, 17 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The input modalities, downstream applications and social impacts of foundation models for music
  • Figure 2: Symbolic music representations for the same piece of music
  • Figure 3: Excerpt of Schubert's Impromptu Op. 90 No.4 and its input visualisations
  • Figure 4: Comparison of various music audio self-supervised models evaluated on a range of different tasks, as reported through the MARBLE benchmark yuan2024marble. Figure reprinted with permission.
  • Figure 5: A broad taxonomy of pre-training strategies for Music Foundation Models. We categorise these strategies into Contrastive Learning (\ref{['fig:subfig1']}), Generative Pre-training (\ref{['fig:subfig2']}), and Masked Modelling (\ref{['fig:subfig3']}).
  • ...and 4 more figures