Table of Contents
Fetching ...

GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, Yi Ma

TL;DR

GAUSS targets a fundamental gap in mathematical reasoning benchmarks by shifting evaluation from topic accuracy to a fine-grained, skill-based profile of LLMs. It introduces a three-domain, twelve-skill taxonomy and annotates problems with cognitive tags to diagnose strengths and gaps in knowledge recall, theorem understanding, symbolic manipulation, problem-solving strategies, intuition, and meta-skills. The framework is demonstrated via a skill profile for GPT-5-thinking, revealing strong memory and evaluation abilities but weaknesses in theorem understanding, computational fluency, and generalization, with comparative gains over o4-mini-high in a few categories. To advance the field, GAUSS plans to broaden the problem pool, develop automatic grading pipelines, and create an open, extensible platform for community contributions, aiming to enable precise, interpretable tracking of mathematical progress in language models and guide targeted improvements.

Abstract

We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs' mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models' mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textsc{GAUSS} benchmark, we have derived the skill profile of \textsc{GPT-5-thinking}, revealing its strengths and weaknesses as well as its differences relative to \textsc{o4-mini-high}, thereby underscoring the value of multidimensional, skill-based evaluation.

GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

TL;DR

GAUSS targets a fundamental gap in mathematical reasoning benchmarks by shifting evaluation from topic accuracy to a fine-grained, skill-based profile of LLMs. It introduces a three-domain, twelve-skill taxonomy and annotates problems with cognitive tags to diagnose strengths and gaps in knowledge recall, theorem understanding, symbolic manipulation, problem-solving strategies, intuition, and meta-skills. The framework is demonstrated via a skill profile for GPT-5-thinking, revealing strong memory and evaluation abilities but weaknesses in theorem understanding, computational fluency, and generalization, with comparative gains over o4-mini-high in a few categories. To advance the field, GAUSS plans to broaden the problem pool, develop automatic grading pipelines, and create an open, extensible platform for community contributions, aiming to enable precise, interpretable tracking of mathematical progress in language models and guide targeted improvements.

Abstract

We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs' mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models' mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textsc{GAUSS} benchmark, we have derived the skill profile of \textsc{GPT-5-thinking}, revealing its strengths and weaknesses as well as its differences relative to \textsc{o4-mini-high}, thereby underscoring the value of multidimensional, skill-based evaluation.

Paper Structure

This paper contains 124 sections, 372 equations, 32 figures, 3 tables.

Figures (32)

  • Figure 1: Example LLM Math Skills Radar Chart
  • Figure :
  • Figure :
  • Figure :
  • Figure :
  • ...and 27 more figures

Theorems & Definitions (89)

  • proof : Response of GPT-5-thinking
  • proof : Standard Solution
  • proof : Response of GPT-5-thinking
  • proof : Standard Solution
  • proof : Response of GPT-5-thinking
  • proof : Standard Solution
  • proof : Response of GPT-5-thinking
  • proof : Standard Solution
  • proof : Response of GPT-5-thinking
  • proof : Standard Solution
  • ...and 79 more