Table of Contents
Fetching ...

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, Hai Zhao

TL;DR

ZIQI-Eval introduces a large-scale, music-focused benchmark for evaluating LLMs, addressing a critical gap in assessing musical comprehension and generation. By curating over 14k entries across 10 categories and 56 subcategories, the framework enables systematic, bias-aware evaluation of 16 LLMs, including API-based and open-source models. Key findings show broad underperformance in musical tasks, with API models generally outperforming open-source counterparts, and notable biases across gender, race, and region. The work highlights GPT-4's relative strengths and persistent gaps, and it proposes future multimodal extensions to better capture the full scope of musical expertise in LLMs.

Abstract

Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval} and HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}.

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

TL;DR

ZIQI-Eval introduces a large-scale, music-focused benchmark for evaluating LLMs, addressing a critical gap in assessing musical comprehension and generation. By curating over 14k entries across 10 categories and 56 subcategories, the framework enables systematic, bias-aware evaluation of 16 LLMs, including API-based and open-source models. Key findings show broad underperformance in musical tasks, with API models generally outperforming open-source counterparts, and notable biases across gender, race, and region. The work highlights GPT-4's relative strengths and persistent gaps, and it proposes future multimodal extensions to better capture the full scope of musical expertise in LLMs.

Abstract

Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval} and HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}.
Paper Structure (37 sections, 1 equation, 3 figures, 4 tables)

This paper contains 37 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: ZIQI-Eval task overview.
  • Figure 2: Examples of music comprehension and music generation test.
  • Figure 3: Performance of LLMs on gender bias, racial and region bias. Subfigure (a) shows F1 scores of every LLM regarding biases. A line graph is plotted using the average F1 of each LLM to show LLM's overall bias condition. Subfigure (b) depicts the distribution of the biases. Top left: gender bias, top right: racial bias, bottom left: European region, bottom right: other regions.