MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning
Debrup Das, Debopriyo Banerjee, Somak Aditya, Ashish Kulkarni
TL;DR
MathSensei introduces a modular, tool-augmented LLM framework for mathematical reasoning, integrating knowledge retrieval, program generation/execution, and symbolic computation. Through extensive ablations across the MATH, AQUA-RAT, GSM-8K, and MMLU-Math datasets, the authors demonstrate that the best performance arises from a Synergy of WA and Python-based computation (PG) with a final SG step, outperforming standard CoT prompts and several baselines. The study provides nuanced guidance on when each tool helps (e.g., WA for algebraic problems, BS for knowledge retrieval, PG for complex computations) and shows that planning strategies like Plan-And-Solve or REACT offer limited gains compared to carefully configured tool pipelines. Overall, MathSensei highlights the potential and limits of TALMs in mathematics, pointing to domain-aware tool selection and the need for math-specific planning approaches to further boost performance. The work contributes a comprehensive experimental framework and actionable insights for designing future mathematical TALMs.
Abstract
Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
