Table of Contents
Fetching ...

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Arash Gholami Davoodi, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

TL;DR

The paper introduces the Mathematical Topics Tree (MaTT) benchmark, a large, hierarchically organized collection of 1,958 multiple-choice questions across 12 mathematical topics, constructed from Wikipedia topic lists and canonical textbooks. It evaluates a range of LLMs (including GPT-4, ChatGPT, o1-mini, Llama3.1, and Mistral) and finds limited overall reasoning capabilities, with GPT-4 achieving only ~54% accuracy and little improvement from Chain-of-Thought prompting; performance also degrades notably when options are not provided. A detailed analysis of correct responses shows that only 53.3% of GPT-4 explanations are complete, indicating that many correct answers rely on non-reasoning strategies such as choice engineering, theorem use, circular reasoning, or memorization. The study highlights substantial topic- and subtopic-level variability, underscoring gaps in genuine mathematical reasoning and justifying the release of MaTT's code and data to spur further benchmarking and model improvements.

Abstract

Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are genuinely engaging in reasoning. To address these gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions across a wide array of mathematical subjects, each paired with a detailed hierarchical chain of topics. Upon assessing different LLMs using the MaTT benchmark, we find that the most advanced model, GPT-4, achieved a mere 54\% accuracy in a multiple-choice scenario. Interestingly, even when employing Chain-of-Thought prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy dramatically reduced by up to 24.2 percentage point when the questions were presented without providing choices. Further detailed analysis of the LLMs' performance across a range of topics showed significant discrepancy even for closely related subtopics within the same general mathematical area. In an effort to pinpoint the reasons behind LLMs performances, we conducted a manual evaluation of the completeness and correctness of the explanations generated by GPT-4 when choices were available. Surprisingly, we find that in only 53.3\% of the instances where the model provided a correct answer, the accompanying explanations were deemed complete and accurate, i.e., the model engaged in genuine reasoning.

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

TL;DR

The paper introduces the Mathematical Topics Tree (MaTT) benchmark, a large, hierarchically organized collection of 1,958 multiple-choice questions across 12 mathematical topics, constructed from Wikipedia topic lists and canonical textbooks. It evaluates a range of LLMs (including GPT-4, ChatGPT, o1-mini, Llama3.1, and Mistral) and finds limited overall reasoning capabilities, with GPT-4 achieving only ~54% accuracy and little improvement from Chain-of-Thought prompting; performance also degrades notably when options are not provided. A detailed analysis of correct responses shows that only 53.3% of GPT-4 explanations are complete, indicating that many correct answers rely on non-reasoning strategies such as choice engineering, theorem use, circular reasoning, or memorization. The study highlights substantial topic- and subtopic-level variability, underscoring gaps in genuine mathematical reasoning and justifying the release of MaTT's code and data to spur further benchmarking and model improvements.

Abstract

Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are genuinely engaging in reasoning. To address these gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions across a wide array of mathematical subjects, each paired with a detailed hierarchical chain of topics. Upon assessing different LLMs using the MaTT benchmark, we find that the most advanced model, GPT-4, achieved a mere 54\% accuracy in a multiple-choice scenario. Interestingly, even when employing Chain-of-Thought prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy dramatically reduced by up to 24.2 percentage point when the questions were presented without providing choices. Further detailed analysis of the LLMs' performance across a range of topics showed significant discrepancy even for closely related subtopics within the same general mathematical area. In an effort to pinpoint the reasons behind LLMs performances, we conducted a manual evaluation of the completeness and correctness of the explanations generated by GPT-4 when choices were available. Surprisingly, we find that in only 53.3\% of the instances where the model provided a correct answer, the accompanying explanations were deemed complete and accurate, i.e., the model engaged in genuine reasoning.
Paper Structure (23 sections, 1 equation, 5 figures, 6 tables)

This paper contains 23 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that presents questions spanning a diverse range of mathematical subjects, each associated with a detailed hierarchical structure of topics.
  • Figure 2: Per-topic breakdown for pure Math.
  • Figure 3: Per-topic breakdown for applied Math.
  • Figure 4: Overview of per topic breakdown for topics under Mathematics/Pure. In this Figure we can observe that in the majority of subtopics (35 out of 38) o1-mini is outperforming Llama3.1, while in the rest of 3 out of 38 subtopics Llama3.1 is outperforming o1-mini.
  • Figure 5: Overview of per topic breakdown for topics under Mathematics/Applied. In this Figure we can observe that in the majority of subtopics (44 out of 47) o1-mini is outperforming Llama3.1, while in the rest of 3 out of 47 subtopics Llama3.1 is outperforming o1-mini.