Table of Contents
Fetching ...

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yuxin Li, Yumeng Wang, Yi R. Fung

TL;DR

RFMDataset presents a 200-problem natural-language proof benchmark designed to reveal failure modes in state-of-the-art large reasoning models. It couples an LLM-as-a-judge with human validation to classify proofs into a fine-grained taxonomy of over 10 failure types and reports pass@1 accuracy across 10 models. The findings show pervasive gaps in rigorous, single-step reasoning and a notable prevalence of hallucinations and incomplete proofs, with geometry and other domains proving particularly challenging; self-reflection prompts offer limited improvement. The work argues for formal-verification–driven training and domain-specific data as a path toward more reliable mathematical reasoning in LLMs.

Abstract

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

TL;DR

RFMDataset presents a 200-problem natural-language proof benchmark designed to reveal failure modes in state-of-the-art large reasoning models. It couples an LLM-as-a-judge with human validation to classify proofs into a fine-grained taxonomy of over 10 failure types and reports pass@1 accuracy across 10 models. The findings show pervasive gaps in rigorous, single-step reasoning and a notable prevalence of hallucinations and incomplete proofs, with geometry and other domains proving particularly challenging; self-reflection prompts offer limited improvement. The work argues for formal-verification–driven training and domain-specific data as a path toward more reliable mathematical reasoning in LLMs.

Abstract

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

Paper Structure

This paper contains 50 sections, 61 equations, 30 figures, 9 tables.

Figures (30)

  • Figure 1: An example of the failure case of Gemini-2.5-pro-preview-0506 in RFMDataset. In this case, the judge LLM meticulously examined the proof and identified "Circular Reasoning" as the failure mode.
  • Figure 2: The knowledge distribution of RFMDataset. We collect diverse questions from various domains to ensure the comprehensiveness of our evaluation.
  • Figure 3: The construction of RFMDataset and evaluation pipeline. We first manually collect challenging and diverse mathematical proof problems. Subsequently, we use an LLM-as-a-judge approach to evaluate the performance of inference models on this data, categorizing errors into various failure modes.
  • Figure 4: (a) The proportion of each error pattern of each model. The Average column summarizes the average proportion across all models for each pattern. (b) Accuracy of each model on every knowledge domain. The Average column summarizes the average accuracy across all models for each domain.
  • Figure 5: Accuracy of different models on each difficulty level.
  • ...and 25 more figures