U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga
TL;DR
U-MATH introduces a university-level, multimodal mathematics benchmark consisting of 1,100 open-ended problems across six subjects, with $\boldsymbol{\mu}$-MATH providing a meta-evaluation framework for judging free-form solutions. The study shows that current LLMs achieve limited problem-solving performance (roughly 63% on text and 45% on visuals) and that solution judging remains challenging, even for top models, with meta-evaluation macro $F1$ reaching around 80%. Dataset collection combines real coursework with expert validation, and the accompanying meta-evaluation reveals notable prompting- and model-dependent biases in judging. The work emphasizes the need for robust evaluation pipelines, tool augmentation, and advanced meta-evaluation to advance LLM mathematical reasoning in real-world, university-level contexts.
Abstract
The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $μ$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $μ$-MATH.
