Multi-tool Integration Application for Math Reasoning Using Large Language Model
Zhihua Duan, Jialin Wang
TL;DR
The paper addresses the challenge of robust mathematical reasoning by proposing a tool-augmented framework that lets a large language model interact with external instruments (Math Tool, Code Tool, and CoT Tool) and a Self Consistency Tool during inference. The approach orchestrates calculation, code execution, iterative reasoning, and consensus-based answer selection to improve reliability and accuracy on math tasks. Evaluations on NumGLUE Task 4 demonstrate strong performance and clear advantages over baselines like GPT-3 Few-Shot and fine-tuned models, illustrating the synergistic effect of multi-tool collaboration. The work offers a reusable architecture that can be extended with additional tools, potentially boosting reasoning in broader domains beyond arithmetic problem solving.
Abstract
Mathematical reasoning is an important research direction in the field of artificial intelligence. This article proposes a novel multi tool application framework for mathematical reasoning, aiming to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools. Firstly, use a Math Tool to perform basic mathematical calculations during the inference process through interaction with LLM. Secondly, Code Tool can generate code fragments that comply with syntax rules and execute them, providing support for complex mathematical problems. Then, through the iterative reasoning of the CoT Tool, the logical coherence and accuracy of mathematical reasoning are enhanced. Ultimately, by using self consistency tools to select the final answer based on different parameters, the consistency and reliability of reasoning are improved. Through the synergistic effect of these tools, the framework has achieved significant performance improvement in mathematical reasoning tasks. We conducted experiments on the NumGLUE Task 4 test set, which includes 220 mathematical reasoning fill in the blank questions. The experimental results showed that, based on Math Tool, Code Tool, and CoT Tool, in Task 4 task,our method achieved an accuracy of 89.09,compared with the GPT3+FewShot baseline, Few Shot+ERNIE-4.0+self consistency improved by 49.09%, and compared with fine-tuning the Fine tuning baseline, Few Shot+ERNIE-4.0+self consistency improved by 52.29%
