Table of Contents
Fetching ...

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga

TL;DR

U-MATH introduces a university-level, multimodal mathematics benchmark consisting of 1,100 open-ended problems across six subjects, with $\boldsymbol{\mu}$-MATH providing a meta-evaluation framework for judging free-form solutions. The study shows that current LLMs achieve limited problem-solving performance (roughly 63% on text and 45% on visuals) and that solution judging remains challenging, even for top models, with meta-evaluation macro $F1$ reaching around 80%. Dataset collection combines real coursework with expert validation, and the accompanying meta-evaluation reveals notable prompting- and model-dependent biases in judging. The work emphasizes the need for robust evaluation pipelines, tool augmentation, and advanced meta-evaluation to advance LLM mathematical reasoning in real-world, university-level contexts.

Abstract

The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $μ$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $μ$-MATH.

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

TL;DR

U-MATH introduces a university-level, multimodal mathematics benchmark consisting of 1,100 open-ended problems across six subjects, with -MATH providing a meta-evaluation framework for judging free-form solutions. The study shows that current LLMs achieve limited problem-solving performance (roughly 63% on text and 45% on visuals) and that solution judging remains challenging, even for top models, with meta-evaluation macro reaching around 80%. Dataset collection combines real coursework with expert validation, and the accompanying meta-evaluation reveals notable prompting- and model-dependent biases in judging. The work emphasizes the need for robust evaluation pipelines, tool augmentation, and advanced meta-evaluation to advance LLM mathematical reasoning in real-world, university-level contexts.

Abstract

The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release -MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on -MATH.

Paper Structure

This paper contains 34 sections, 32 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: U-MATH covers university-level topics and require multiple steps to solve. A random sample is provided; reference solution is shortened. In this example, common error is overlooking time non-negativity.
  • Figure 2: Performance of the selected top-performing models on U-MATH, U-MATH$_\text{Text}$ and U-MATH$_\text{Visual}$. Color denotes different model families, 'visual' label highlight visual encoder of the model. Higher is better for all charts.
  • Figure 3: Relative differences between specific judgment performance --- i.e. over samples with solutions generated by a specific author model --- and integral judgment performance across all the samples. The judgment performance is measured by the $\boldsymbol{\mu}$-MATH macro F1-scores. Each pane corresponds to a different author model considered when measuring specific performance. The x-axis specifies which judge corresponds to a particular bar pair, with bar pairs comparing the above-described relative diffs in case of AutoCoT and CoT prompting schemes.
  • Figure 4: Example text-only and visual problems from the U-MATH benchmark, illustrating the topic, problem, and golden answer.
  • Figure 5: An example problem from the U-MATH benchmark, illustrating the problem, reference solution and golden answer.
  • ...and 11 more figures