MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

Zihao Wang; Shuyu Li; Tao Zhang; Qi Wang; Pengfei Yu; Jinyang Luo; Yan Liu; Ming Xi; Kejun Zhang

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

Zihao Wang, Shuyu Li, Tao Zhang, Qi Wang, Pengfei Yu, Jinyang Luo, Yan Liu, Ming Xi, Kejun Zhang

TL;DR

MuChin addresses the lack of benchmarks for evaluating Chinese colloquial music description in multimodal LLMs. It introduces CaiMAP and CaiMD to enable a multi-stage annotation workflow that yields high-quality, public-aligned data and a diverse task set, including textual descriptions and lyric-generation challenges. The paper evaluates both professional and amateur descriptions, and benchmarks several generative LLMs and music-understanding models using novel lyric-structure and description-quality metrics. This work provides a practical framework and dataset to advance Chinese music-language understanding and supports targeted fine-tuning of LLMs for music-related tasks.

Abstract

The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1,000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of annotated data for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music. All data related to the benchmark, along with the scoring code and detailed appendices, have been open-sourced (https://github.com/CarlWangChina/MuChin/).

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

TL;DR

Abstract

Paper Structure (46 sections, 3 equations, 16 figures, 7 tables, 1 algorithm)

This paper contains 46 sections, 3 equations, 16 figures, 7 tables, 1 algorithm.

Introduction
Establishment of MuChin Benchmark
Benchmark Tasks
Textual Description Task
Lyric Generation Task
Tasks with Automatic Annotation
Preparation and Settings
Data Preprocessing
Recruitment and Training of Individuals
Annotation and Assurance Pipeline
Screening & Structure Annotation Phase
Structure Quality Assurance Phase
Description Annotation Phase
Description Quality Assurance Phase
Admin Spot-Check & Settlement Phase
...and 31 more sections

Figures (16)

Figure 1: An overview of the MuChin benchmark. The Chinese Colloquial Descriptions consist of Description(A) and Common Description(P & A) annotated by amateur annotators. In addition, we recruit professional annotators to label Description(P), Musical Sections, and Rhyming Structures of the lyrics. And machine-annotated information such as MIDI is also incorporated. These enable MuChin to adapt to a wider range of benchmark tasks.
Figure 2: Pipeline of data annotation and assurance. Each annotated data undergoes 5 complex phases to ensure the accuracy. The figure shows the actual screenshots of the pages for each phase. For software development and operation details please refer to Appendix \ref{['app:caimap']}.
Figure 3: Semantic similarity scores between professionals and amateurs. When a specific type of music is selected, we calculate the similarity between the two groups in various dimensions, for which the calculation method is discussed in Section \ref{['sec:und']}. As a smaller value signifies a larger discrepancy, the experimental results in this figure reveal significant gaps between the two groups across several specific dimensions.
Figure 4: Supplementary actual screenshots from the main text. A screenshot of the 'Song Purpose' section during the Description Annotation Phase.
Figure 5: Supplementary actual screenshots from the main text. A screenshot of the 'Song Purpose' section during the Description Quality Assurance Phase.
...and 11 more figures

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

TL;DR

Abstract

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

Authors

TL;DR

Abstract

Table of Contents

Figures (16)