Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach
Afrar Jahin, Arif Hassan Zidan, Wei Zhang, Yu Bao, Tianming Liu
TL;DR
This work presents a fine-grained, cross-model evaluation of mathematical reasoning across eight leading LLMs, focusing on deep mathematical domains via MATH, GSM8K, and MMLU benchmarks. Using zero-shot prompting and the DeepEval scoring framework, it measures Correctness, Clarity, and Reasoning to reveal how architectural choices and training paradigms affect performance, with DeepSeek-R1 and o3-mini emerging as strong performers and distilled DeepSeek variants showing degradation. The results highlight a latency–accuracy trade-off, where Gemini 2.0 Flash offers the fastest responses at solid accuracy, while others balance higher reasoning quality with longer runtimes. The discussion advocates hybrid reasoning approaches (e.g., combining GRPO with CoT) and cautions about distillation's impact on symbolic capabilities, offering guidance for designing future LLMs with rigorous mathematical reasoning capabilities.
Abstract
With the rapid advancement of Artificial Intelligence (AI), Large Language Models (LLMs) have significantly impacted a wide array of domains, including healthcare, engineering, science, education, and mathematical reasoning. Among these, mathematical reasoning remains a particularly challenging capability, often requiring multi-step logic and abstract generalization. While prior work has explored LLM performance on reasoning tasks, comprehensive evaluations that span both depth and breadth across model families remain limited. In this study, we present a systematic evaluation of mathematical reasoning abilities across eight leading LLMs, including two recent DeepSeek models, using three independent benchmark datasets. Our analyses reveal several key findings: (1) DeepSeek-R1 performs competitively with o1 across most domains and achieves the highest accuracy on the MMLU Formal Logic benchmark; (2) distilled variants, such as DeepSeek-1.5B, exhibit substantial performance degradation; and (3) Gemini 2.0 Flash achieves the lowest response latency. Beyond quantitative metrics, we explore how architectural choices, training paradigms, and optimization strategies contribute to variation in reasoning performance. These findings provide new insights into the capabilities and limitations of current LLMs in mathematical domains, and offer guidance for the development of future models better aligned with rigorous reasoning demands.
