MathDivide: Improved mathematical reasoning by large language models
Saksham Sahai Srivastava, Ashutosh Gandhi
TL;DR
The paper addresses improving mathematical reasoning in large language models by introducing a prompting framework named MathDivide. MathDivide decomposes a problem $\mathcal{M}$ into subproblems $\mathcal{M}_i$, forms algebraic expressions $e_i$, and uses Python code $p_i^c$ to evaluate them on numeric inputs, aggregating results into a final answer $s$. If $s \neq z$, a human-guided refinement prompt $p_r$ is used to correct the approach, enabling iterative improvement; experiments on 250 GSM8K problems across GPT-3.5-turbo, GPT-4, Llama2-7B, and Llama3-8B show MathDivide outperforms the MathPrompter baseline, with refinement loops boosting accuracy. The work demonstrates the value of structured problem decomposition and external computation for math tasks, discusses limitations such as dataset size and reliance on manual code execution, and proposes directions like automated refinements and real-time learning.
Abstract
Large language models have been proven to be capable of handling complex linguistic and cognitive tasks. Therefore their usage has been extended to tasks requiring logical reasoning ability such as Mathematics. In this paper, we propose a prompting technique called MathDivide that breaks down the mathematical problem into simpler subproblems. Each of the subproblems is formulated as an algebraic expression whose value is evaluated by the Python code generated by the LLM for the corresponding algebraic expression. The values fed to the Python code are the numerical values provided in the problem statement. The solutions for the subproblems are composed together to obtain the final answer for the problem statement. Finally, the final answer is compared to the correct answer. If the final answer matches the correct answer, it is produced as output else a refinement prompt is fed to the LLM. We experiment with this prompting technique on both closed-source LLM models and open-source LLM models using GSM8K dataset. The results obtained demonstrate that MathDivide was able to significantly outperform the leading prompting technique called Math-prompter.
