Table of Contents
Fetching ...

System-2 Mathematical Reasoning via Enriched Instruction Tuning

Huanqia Cai, Yijun Yang, Zhifeng Li

TL;DR

This work introduces Enriched Instruction Tuning (EIT) to lift system-2 mathematical reasoning in LLMs by enriching existing mathematical datasets with fine-grained reasoning trajectories through ERP (Reasoning Plan) and ERS (Reasoning Step). By combining human and AI feedback, EIT creates a high-quality training set (EITMath) and fine-tunes open-source LLMs (e.g., LLaMA-2) without external tools, achieving strong results on MATH ($32.5\%$) and GSM8K ($84.1\%$). The approach demonstrates that more granular reasoning data, combined with larger data scales, yields better performance and that EIT can rival tool-augmented methods in mathematical benchmarking. These findings highlight the importance of data quality and reasoning trajectory design for scaling mathematical reasoning in LLMs.

Abstract

Solving complex mathematical problems via system-2 reasoning is a natural human skill, yet it remains a significant challenge for current large language models (LLMs). We identify the scarcity of deliberate multi-step reasoning data as a primary limiting factor. To this end, we introduce Enriched Instruction Tuning (EIT), a method that enriches existing human-annotated mathematical datasets by synergizing human and AI feedback to create fine-grained reasoning trajectories. These datasets are then used to fine-tune open-source LLMs, enhancing their mathematical reasoning abilities without reliance on any symbolic verification program. Concretely, EIT is composed of two critical steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS). The former generates a high-level plan that breaks down complex instructions into a sequence of simpler objectives, while ERS fills in reasoning contexts often overlooked by human annotators, creating a smoother reasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods that generate reasoning chains only depending on LLM's internal knowledge, our method leverages human-annotated initial answers as ``meta-knowledge'' to help LLMs generate more detailed and precise reasoning processes, leading to a more trustworthy LLM expert for complex mathematical problems. In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods.

System-2 Mathematical Reasoning via Enriched Instruction Tuning

TL;DR

This work introduces Enriched Instruction Tuning (EIT) to lift system-2 mathematical reasoning in LLMs by enriching existing mathematical datasets with fine-grained reasoning trajectories through ERP (Reasoning Plan) and ERS (Reasoning Step). By combining human and AI feedback, EIT creates a high-quality training set (EITMath) and fine-tunes open-source LLMs (e.g., LLaMA-2) without external tools, achieving strong results on MATH () and GSM8K (). The approach demonstrates that more granular reasoning data, combined with larger data scales, yields better performance and that EIT can rival tool-augmented methods in mathematical benchmarking. These findings highlight the importance of data quality and reasoning trajectory design for scaling mathematical reasoning in LLMs.

Abstract

Solving complex mathematical problems via system-2 reasoning is a natural human skill, yet it remains a significant challenge for current large language models (LLMs). We identify the scarcity of deliberate multi-step reasoning data as a primary limiting factor. To this end, we introduce Enriched Instruction Tuning (EIT), a method that enriches existing human-annotated mathematical datasets by synergizing human and AI feedback to create fine-grained reasoning trajectories. These datasets are then used to fine-tune open-source LLMs, enhancing their mathematical reasoning abilities without reliance on any symbolic verification program. Concretely, EIT is composed of two critical steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS). The former generates a high-level plan that breaks down complex instructions into a sequence of simpler objectives, while ERS fills in reasoning contexts often overlooked by human annotators, creating a smoother reasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods that generate reasoning chains only depending on LLM's internal knowledge, our method leverages human-annotated initial answers as ``meta-knowledge'' to help LLMs generate more detailed and precise reasoning processes, leading to a more trustworthy LLM expert for complex mathematical problems. In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods.

Paper Structure

This paper contains 32 sections, 11 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of LLM's response generated by four different methods on a randomly selected problem from the MATH dataset MATH. (a) MetaMath considers factorization, but the problem-solving plan is wrong and ignores other odd factors. (b) Chain-of-thought prompting (CoT) generates a valid plan, but hallucinations occur during the solution process, leading to a trivial calculation error: if $a$ equals 1 and $k$ is 599.5, then $b$ is 1200, which is not an odd integer. (c) The LLM fine-tuned by our Enriched Instruction Tuning (EIT) first generates a high-level plan (blue parts) decomposing the original question into a sequence of lower-level objectives and then produces a fine-grained reasoning trajectory (purple parts) with the guidance from human-provided initial answers.
  • Figure 2: Pipeline of Enriched Instruction Tuning (EIT), which first leverages a privileged LLM, e.g., GPT-4, to produce enriched reasoning steps for the existing mathematical instruction dataset through our proposed ERP and ERS prompting methods, and then trains an LLM on this enriched dataset via instruction tuning. Note that the Original Response provided by human annotators overlooks the bold context in Enriched Response, which however is critical to problem-solving.
  • Figure 3: left: Scaling up of performance on MATH as adding our EITMath dataset for LLM fine-tuning. Different colored bars represent MetaMathQA combined with Rejection sampling Fine-Tuning (RFT) and EITMath as the training set. middle: Scaling up of performance on MATH as more fine-grained reasoning steps created for fine-tuning. We use average tokens (more is better) of the response to measure its granularity. right: Perplexity and accuracy of different methods on GSM8K. Following metamath2023, the perplexity is calculated using under-finetuned LLaMA-2-7B, and the accuracy is reported based on our fine-tuned LLaMA-2-70B model on these datasets. It is clear that our EITMath has less perplexity compared to other mathematical datasets, which leads to better performance.