Table of Contents
Fetching ...

MusicEval: A Generative Music Dataset with Expert Ratings for Automatic Text-to-Music Evaluation

Cheng Liu, Hui Wang, Jinghua Zhao, Shiwan Zhao, Hui Bu, Xin Xu, Jiaming Zhou, Haoqin Sun, Yong Qin

TL;DR

This work tackles the challenge of evaluating text-to-music (TTM) systems by proposing an automatic evaluation framework aligned with human perception. It introduces MusicEval, a first-of-its-kind expert-scored dataset containing 2,748 music clips generated by 31 models in response to 384 prompts, with 13,740 ratings from 14 music experts, and demonstrates CLAP-based automatic scoring for two evaluation dimensions: musical impression and textual alignment. The dataset spans 16.62 hours of mono audio, diverse system types, and carefully designed prompts, accompanied by reliability measures for expert ratings. A CLAP-based baseline model is shown to predict these scores with strong correlations to human judgments, establishing a valuable reference for future automatic TTM evaluation research. The work offers practical impact by enabling scalable, human-aligned evaluation for TTM models and datasets, with potential to guide model development and benchmarking across generative music research.

Abstract

The technology for generating music from textual descriptions has seen rapid advancements. However, evaluating text-to-music (TTM) systems remains a significant challenge, primarily due to the difficulty of balancing performance and cost with existing objective and subjective evaluation methods. In this paper, we propose an automatic assessment task for TTM models to align with human perception. To address the TTM evaluation challenges posed by the professional requirements of music evaluation and the complexity of the relationship between text and music, we collect MusicEval, the first generative music assessment dataset. This dataset contains 2,748 music clips generated by 31 advanced and widely used models in response to 384 text prompts, along with 13,740 ratings from 14 music experts. Furthermore, we design a CLAP-based assessment model built on this dataset, and our experimental results validate the feasibility of the proposed task, providing a valuable reference for future development in TTM evaluation. The dataset is available at https://www.aishelltech.com/AISHELL_7A.

MusicEval: A Generative Music Dataset with Expert Ratings for Automatic Text-to-Music Evaluation

TL;DR

This work tackles the challenge of evaluating text-to-music (TTM) systems by proposing an automatic evaluation framework aligned with human perception. It introduces MusicEval, a first-of-its-kind expert-scored dataset containing 2,748 music clips generated by 31 models in response to 384 prompts, with 13,740 ratings from 14 music experts, and demonstrates CLAP-based automatic scoring for two evaluation dimensions: musical impression and textual alignment. The dataset spans 16.62 hours of mono audio, diverse system types, and carefully designed prompts, accompanied by reliability measures for expert ratings. A CLAP-based baseline model is shown to predict these scores with strong correlations to human judgments, establishing a valuable reference for future automatic TTM evaluation research. The work offers practical impact by enabling scalable, human-aligned evaluation for TTM models and datasets, with potential to guide model development and benchmarking across generative music research.

Abstract

The technology for generating music from textual descriptions has seen rapid advancements. However, evaluating text-to-music (TTM) systems remains a significant challenge, primarily due to the difficulty of balancing performance and cost with existing objective and subjective evaluation methods. In this paper, we propose an automatic assessment task for TTM models to align with human perception. To address the TTM evaluation challenges posed by the professional requirements of music evaluation and the complexity of the relationship between text and music, we collect MusicEval, the first generative music assessment dataset. This dataset contains 2,748 music clips generated by 31 advanced and widely used models in response to 384 text prompts, along with 13,740 ratings from 14 music experts. Furthermore, we design a CLAP-based assessment model built on this dataset, and our experimental results validate the feasibility of the proposed task, providing a valuable reference for future development in TTM evaluation. The dataset is available at https://www.aishelltech.com/AISHELL_7A.
Paper Structure (17 sections, 5 figures, 1 table)

This paper contains 17 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The distribution of musical impression scores and textual alignment scores in the MusicEval dataset.
  • Figure 2: The distribution chart of the musical impression scores for each system in the MusicEval dataset.
  • Figure 3: The distribution of the textual alignment scores for each system in the MusicEval dataset.
  • Figure 4: The pie chart of system information across four dimensions: accessibility, commercialization, year, and model size.
  • Figure 5: The distributional analysis of prompt length.