GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models
Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, Yi Ma
TL;DR
GAUSS targets a fundamental gap in mathematical reasoning benchmarks by shifting evaluation from topic accuracy to a fine-grained, skill-based profile of LLMs. It introduces a three-domain, twelve-skill taxonomy and annotates problems with cognitive tags to diagnose strengths and gaps in knowledge recall, theorem understanding, symbolic manipulation, problem-solving strategies, intuition, and meta-skills. The framework is demonstrated via a skill profile for GPT-5-thinking, revealing strong memory and evaluation abilities but weaknesses in theorem understanding, computational fluency, and generalization, with comparative gains over o4-mini-high in a few categories. To advance the field, GAUSS plans to broaden the problem pool, develop automatic grading pipelines, and create an open, extensible platform for community contributions, aiming to enable precise, interpretable tracking of mathematical progress in language models and guide targeted improvements.
Abstract
We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs' mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models' mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textsc{GAUSS} benchmark, we have derived the skill profile of \textsc{GPT-5-thinking}, revealing its strengths and weaknesses as well as its differences relative to \textsc{o4-mini-high}, thereby underscoring the value of multidimensional, skill-based evaluation.
