Table of Contents
Fetching ...

Systematic Diagnosis of Brittle Reasoning in Large Language Models

V. S. Raghu Parupudi

TL;DR

This work addresses the problem of diagnosing brittle mathematical reasoning in LLMs beyond final answers by generating structured stepwise traces with a generator on GSM8K, diagnosing failures with a more capable analyst, and applying unsupervised clustering to identify emergent reasoning modes. The approach yields a granular cognitive profile, distinguishing robust procedural modes from brittle combinatorial/abstract modes, with statistical validation (Fisher's exact test, $p<0.05$) and an overall accuracy of $84.9\%$. The key contributions include a novel diagnostic framework for mapping reasoning into distinct modes and quantifying their reliability, offering a data-driven roadmap for targeted improvements and cross-model comparison. This has practical significance for building more reliable mathematical reasoning in AI systems and guiding data-efficient fine-tuning to address specific brittle capabilities.

Abstract

A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent "reasoning modes." This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.

Systematic Diagnosis of Brittle Reasoning in Large Language Models

TL;DR

This work addresses the problem of diagnosing brittle mathematical reasoning in LLMs beyond final answers by generating structured stepwise traces with a generator on GSM8K, diagnosing failures with a more capable analyst, and applying unsupervised clustering to identify emergent reasoning modes. The approach yields a granular cognitive profile, distinguishing robust procedural modes from brittle combinatorial/abstract modes, with statistical validation (Fisher's exact test, ) and an overall accuracy of . The key contributions include a novel diagnostic framework for mapping reasoning into distinct modes and quantifying their reliability, offering a data-driven roadmap for targeted improvements and cross-model comparison. This has practical significance for building more reliable mathematical reasoning in AI systems and guiding data-efficient fine-tuning to address specific brittle capabilities.

Abstract

A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent "reasoning modes." This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.

Paper Structure

This paper contains 12 sections, 2 tables.