Table of Contents
Fetching ...

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu

TL;DR

WildScore introduces the first in-the-wild benchmark for symbolic music reasoning by pairing real score images from public discourse with expert-generated, multiple-choice questions anchored in a structured musicology taxonomy. The approach frames complex symbolic reasoning as MCQ tasks to enable scalable, objective evaluation across visual and textual modalities, and provides a 807-item dataset with ground-truth preferences and difficulty stratification. Empirical results reveal that current vision-language systems show mixed performance, with notable strengths in surface-level recognition but persistent challenges in deep symbolic abstraction, rhythmic interpretation, and orchestration. By releasing dataset, code, and a clear taxonomy, WildScore establishes a practical benchmark to guide future improvements in multimodal symbolic music understanding and analysis.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

TL;DR

WildScore introduces the first in-the-wild benchmark for symbolic music reasoning by pairing real score images from public discourse with expert-generated, multiple-choice questions anchored in a structured musicology taxonomy. The approach frames complex symbolic reasoning as MCQ tasks to enable scalable, objective evaluation across visual and textual modalities, and provides a 807-item dataset with ground-truth preferences and difficulty stratification. Empirical results reveal that current vision-language systems show mixed performance, with notable strengths in surface-level recognition but persistent challenges in deep symbolic abstraction, rhythmic interpretation, and orchestration. By releasing dataset, code, and a clear taxonomy, WildScore establishes a practical benchmark to guide future improvements in multimodal symbolic music understanding and analysis.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

Paper Structure

This paper contains 31 sections, 1 equation, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Example questions from our symbolic music benchmark dataset, illustrating the diversity of high-level categories and subcategories included. For each of the five core categories—Harmony & Tonality (HT), Rhythm & Meter (RM), Texture (Tx), Expression & Performance (EP), and Form (FM)—we present representative samples spanning their respective subcategories. Each panel shows a sample multiple-choice question along with corresponding answer choices, demonstrating the range and depth of musical concepts assessed in our benchmark.
  • Figure 2: Overview of the dataset construction pipeline, including Reddit post collection, music entity extraction, query generation, and candidate retrieval.
  • Figure 3: Distribution of symbolic music questions by high-level category. Category abbreviations: FM: Form, HT: Harmony & Tonality, RM: Rhythm & Meter, Tx: Texture, EP: Expression & Performance.
  • Figure 4: Distribution of symbolic music questions by subcategory. Subcategory abbreviations: PS: Phrase Structure, CF: Contrapuntal Forms, CP: Chord Progressions, MP: Modulation Patterns, MM: Modal Mixture, MS: Metric Structure, RP: Rhythmic Patterns, HTx: Homophonic Texture, PT: Polyphonic Texture, OT: Orchestral Texture, DA: Dynamics & Articulation, TI: Technique & Interpretation.
  • Figure 5: Per-Subcategory QA Accuracy by Vision-Enabled Model