Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback

Zhongtao Miao; Kaiyan Zhao; Yoshimasa Tsuruoka

Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback

Zhongtao Miao, Kaiyan Zhao, Yoshimasa Tsuruoka

TL;DR

This work tackles improving arithmetic reasoning in large language models by introducing ART, a framework that uses relation tuples as semi-structured reasoning steps paired with a local Python-based verifier and a dynamic feedback loop. Each reasoning step is linked to a relation tuple $(r_i,t_i)$ and validated through Python code $C_i$ executed via a local interpreter, producing a verification result $\hat{A_i^v}$ to compare with the initial answer $\hat{A_i}$. Across seven arithmetic datasets and multiple LLMs, ART outperforms natural-language CoT, code-based PAL, and ModelSelection baselines, with notable gains on SVAMP and GSM8K, and it remains compatible with Self-Consistency. The method provides readable, machine-verifiable reasoning and a lightweight, model-agnostic verification pathway that can be integrated into existing prompting pipelines to enhance arithmetic reasoning.

Abstract

Current representations used in reasoning steps of large language models can mostly be categorized into two main types: (1) natural language, which is difficult to verify; and (2) non-natural language, usually programming code, which is difficult for people who are unfamiliar with coding to read. In this paper, we propose to use a semi-structured form to represent reasoning steps of large language models. Specifically, we use relation tuples, which are not only human-readable but also machine-friendly and easier to verify than natural language. We implement a framework that includes three main components: (1) introducing relation tuples into the reasoning steps of large language models; (2) implementing an automatic verification process of reasoning steps with a local code interpreter based on relation tuples; and (3) integrating a simple and effective dynamic feedback mechanism, which we found helpful for self-improvement of large language models. The experimental results on various arithmetic datasets demonstrate the effectiveness of our method in improving the arithmetic reasoning ability of large language models. The source code is available at https://github.com/gpgg/art.

Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback

TL;DR

and validated through Python code

executed via a local interpreter, producing a verification result

to compare with the initial answer

. Across seven arithmetic datasets and multiple LLMs, ART outperforms natural-language CoT, code-based PAL, and ModelSelection baselines, with notable gains on SVAMP and GSM8K, and it remains compatible with Self-Consistency. The method provides readable, machine-verifiable reasoning and a lightweight, model-agnostic verification pathway that can be integrated into existing prompting pipelines to enhance arithmetic reasoning.

Abstract

Paper Structure (30 sections, 5 equations, 25 figures, 8 tables)

This paper contains 30 sections, 5 equations, 25 figures, 8 tables.

Introduction
Method
Problem Formulation
ART Framework
Step 1: Reasoning with relation tuples.
Step 2: Automatic verification with relation triples and a local code interpreter.
Step 3: Checking consistency and providing dynamic feedback when necessary.
Experiments
Setup
Datasets.
Models.
In-context Learning.
Implementation.
Main Results
Analysis and Discussion
...and 15 more sections

Figures (25)

Figure 1: Schematic overview of our framework, ART. "Q" denotes a question. "NL" means "Natural Language". "RT" means "Relation Tuple". The left sub-figure shows our proposed framework ART without Self-Consistency wang2023selfconsistency. The right sub-figure shows that our framework can be integrated with Self-Consistency seamlessly.
Figure 2: A detailed example illustrating how our method works. This example shows the solution to the first question of the test split of the GSM8K dataset, generated by our framework using ChatGPT.
Figure 3: Prompt of relation tuple reasoning in Step 1.
Figure 4: Prompt of program verification in Step 2.
Figure 5: Feedback prompt when ART needs feedback.
...and 20 more figures

Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback

TL;DR

Abstract

Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (25)