Table of Contents
Fetching ...

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning

Debrup Das, Debopriyo Banerjee, Somak Aditya, Ashish Kulkarni

TL;DR

MathSensei introduces a modular, tool-augmented LLM framework for mathematical reasoning, integrating knowledge retrieval, program generation/execution, and symbolic computation. Through extensive ablations across the MATH, AQUA-RAT, GSM-8K, and MMLU-Math datasets, the authors demonstrate that the best performance arises from a Synergy of WA and Python-based computation (PG) with a final SG step, outperforming standard CoT prompts and several baselines. The study provides nuanced guidance on when each tool helps (e.g., WA for algebraic problems, BS for knowledge retrieval, PG for complex computations) and shows that planning strategies like Plan-And-Solve or REACT offer limited gains compared to carefully configured tool pipelines. Overall, MathSensei highlights the potential and limits of TALMs in mathematics, pointing to domain-aware tool selection and the need for math-specific planning approaches to further boost performance. The work contributes a comprehensive experimental framework and actionable insights for designing future mathematical TALMs.

Abstract

Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning

TL;DR

MathSensei introduces a modular, tool-augmented LLM framework for mathematical reasoning, integrating knowledge retrieval, program generation/execution, and symbolic computation. Through extensive ablations across the MATH, AQUA-RAT, GSM-8K, and MMLU-Math datasets, the authors demonstrate that the best performance arises from a Synergy of WA and Python-based computation (PG) with a final SG step, outperforming standard CoT prompts and several baselines. The study provides nuanced guidance on when each tool helps (e.g., WA for algebraic problems, BS for knowledge retrieval, PG for complex computations) and shows that planning strategies like Plan-And-Solve or REACT offer limited gains compared to carefully configured tool pipelines. Overall, MathSensei highlights the potential and limits of TALMs in mathematics, pointing to domain-aware tool selection and the need for math-specific planning approaches to further boost performance. The work contributes a comprehensive experimental framework and actionable insights for designing future mathematical TALMs.

Abstract

Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
Paper Structure (31 sections, 2 equations, 6 figures, 23 tables)

This paper contains 31 sections, 2 equations, 6 figures, 23 tables.

Figures (6)

  • Figure 1: An end-to-end workflow of MathSensei on the compositional setting from the MATH dataset. The final answer is higlighted in green font.
  • Figure 2: Overview of the BS module; We concatenate the similar questions and concepts (which is then used by a downstream module).
  • Figure 3: Overview of the WA module.
  • Figure 4: Overview of (a) Python Generator Module and (b) Code Refiner Module
  • Figure 5: Generated output for example from the MATH dataset for the REACT planning setting.
  • ...and 1 more figures