Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Ankit Satpute; Noah Giessing; Andre Greiner-Petter; Moritz Schubotz; Olaf Teschke; Akiko Aizawa; Bela Gipp

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

TL;DR

This work evaluates large language models on open-ended mathematical questions drawn from Math Stack Exchange through a two-stage approach: generating answers with leading LLMs and performing a case study on GPT-4 to assess accuracy, complemented by retrieval-based analyses using ArqMATH. GPT-4 achieves the best $nDCG$ and $P@10$ among math-tuned models with $nDCG = 0.486$ and $P@10 = 0.374$, and it even surpasses the ArqMATH Task1 baseline in $P@10$. The study also analyzes retrieval-augmented setups (DPR embeddings) and embedding-based question–answer matching, finding that GPT-4 can improve retrieval in some cases but fails to consistently provide fully accurate solutions for complex math questions. Through detailed case analyses, the authors highlight limitations in current AI-driven mathematical reasoning and emphasize the need for verification and retrieval-enhanced strategies, sharing code and data to foster further progress.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange}

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

TL;DR

and

among math-tuned models with

and

, and it even surpasses the ArqMATH Task1 baseline in

. The study also analyzes retrieval-augmented setups (DPR embeddings) and embedding-based question–answer matching, finding that GPT-4 can improve retrieval in some cases but fails to consistently provide fully accurate solutions for complex math questions. Through detailed case analyses, the authors highlight limitations in current AI-driven mathematical reasoning and emphasize the need for verification and retrieval-enhanced strategies, sharing code and data to foster further progress.

Abstract

Paper Structure (14 sections, 1 equation, 4 figures, 2 tables)

This paper contains 14 sections, 1 equation, 4 figures, 2 tables.

Introduction
Related work
Dataset
Methodology
Evaluation
Answer generation
Question-Answer comparison
Case study
GPT-4
https://math.stackexchange.com/questions/4022815/what-does-this-bracket-notation-mean/1958336#1958336: Retrieval Boost
https://math.stackexchange.com/questions/4155217/suppose-that-all-the-tangent-lines-of-a-regular-plane-curve-pass-through-some-fi?noredirect=1&lq=1: Retrival Worsened Figure \ref{['fig:extractedTOIs']}
Tora-7b-Code
https://math.stackexchange.com/questions/4212480/number-of-solutions-of-equation-over-a-finite-field?noredirect=1&lq=1 - Tora-7b-Code boosts retrieval
Conclusion

Figures (4)

Figure 1: Frequency of differences in P@10 values of DPR and GPT-4 (P@10$^{GPT-4}$ - P@10$^{DPR}$).
Figure 2: Question which is correctly answered by GPT-4.
Figure 3: Question which is incorrectly answered by GPT-4. The reason for worsened retrieval shows generated answers irrelevant to the question.
Figure 4: Answer generated by ToRA where it boosts precision.

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

TL;DR

Abstract

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Authors

TL;DR

Abstract

Table of Contents

Figures (4)