Table of Contents
Fetching ...

Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning

Kuofeng Gao, Huanqia Cai, Qingyao Shuai, Dihong Gong, Zhifeng Li

TL;DR

This work introduces Chain of Self-Correction (CoSC), a mechanism that embeds multi-round self-correction into LLMs for mathematical reasoning by having the model generate, run, verify, and iteratively refine a program to solve problems. It implements CoSC via a two-phase fine-tuning strategy: (i) foundational learning with GPT-4-seeded trajectories and (ii) self-enhancement with model-generated data, retaining only correct trajectories. Across MATH and GSM8K, CoSC, especially the CoSC-Code family, achieves state-of-the-art open-source performance and even surpasses several proprietary models on challenging datasets in zero-shot settings. The results demonstrate that embedding self-correction as an intrinsic capability enables weaker LLMs to achieve strong mathematical reasoning, offering a cost-effective and scalable path to robust reasoning in AI systems.

Abstract

Accurate mathematical reasoning with Large Language Models (LLMs) is crucial in revolutionizing domains that heavily rely on such reasoning. However, LLMs often encounter difficulties in certain aspects of mathematical reasoning, leading to flawed reasoning and erroneous results. To mitigate these issues, we introduce a novel mechanism, the Chain of Self-Correction (CoSC), specifically designed to embed self-correction as an inherent ability in LLMs, enabling them to validate and rectify their own results. The CoSC mechanism operates through a sequence of self-correction stages. In each stage, the LLMs generate a program to address a given problem, execute this program using program-based tools to obtain an output, subsequently verify this output. Based on the verification, the LLMs either proceed to the next correction stage or finalize the answer. This iterative self-correction process allows the LLMs to refine its reasoning steps and improve the accuracy of its mathematical reasoning. We implement CoSC using a two-phase fine-tuning approach. First, LLMs are trained with a relatively small volume of seeding data generated from GPT-4. Then, we enhance CoSC by training with a larger volume of self-generated data, without relying on GPT-4. Experiments show that CoSC significantly boosts performance on standard mathematical datasets compared to existing open-source LLMs. Notably, our CoSC-Code-34B model achieved a 53.5% score on the challenging MATH dataset, outperforming models like ChatGPT, GPT-4, and multi-modal LLMs such as GPT-4V and Gemini-1.0. Importantly, CoSC operates in a zero-shot manner without requiring demonstrations.

Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning

TL;DR

This work introduces Chain of Self-Correction (CoSC), a mechanism that embeds multi-round self-correction into LLMs for mathematical reasoning by having the model generate, run, verify, and iteratively refine a program to solve problems. It implements CoSC via a two-phase fine-tuning strategy: (i) foundational learning with GPT-4-seeded trajectories and (ii) self-enhancement with model-generated data, retaining only correct trajectories. Across MATH and GSM8K, CoSC, especially the CoSC-Code family, achieves state-of-the-art open-source performance and even surpasses several proprietary models on challenging datasets in zero-shot settings. The results demonstrate that embedding self-correction as an intrinsic capability enables weaker LLMs to achieve strong mathematical reasoning, offering a cost-effective and scalable path to robust reasoning in AI systems.

Abstract

Accurate mathematical reasoning with Large Language Models (LLMs) is crucial in revolutionizing domains that heavily rely on such reasoning. However, LLMs often encounter difficulties in certain aspects of mathematical reasoning, leading to flawed reasoning and erroneous results. To mitigate these issues, we introduce a novel mechanism, the Chain of Self-Correction (CoSC), specifically designed to embed self-correction as an inherent ability in LLMs, enabling them to validate and rectify their own results. The CoSC mechanism operates through a sequence of self-correction stages. In each stage, the LLMs generate a program to address a given problem, execute this program using program-based tools to obtain an output, subsequently verify this output. Based on the verification, the LLMs either proceed to the next correction stage or finalize the answer. This iterative self-correction process allows the LLMs to refine its reasoning steps and improve the accuracy of its mathematical reasoning. We implement CoSC using a two-phase fine-tuning approach. First, LLMs are trained with a relatively small volume of seeding data generated from GPT-4. Then, we enhance CoSC by training with a larger volume of self-generated data, without relying on GPT-4. Experiments show that CoSC significantly boosts performance on standard mathematical datasets compared to existing open-source LLMs. Notably, our CoSC-Code-34B model achieved a 53.5% score on the challenging MATH dataset, outperforming models like ChatGPT, GPT-4, and multi-modal LLMs such as GPT-4V and Gemini-1.0. Importantly, CoSC operates in a zero-shot manner without requiring demonstrations.

Paper Structure

This paper contains 34 sections, 4 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: Comparison of four reasoning frameworks for solving an example mathematical question. (a) Chain of Thoughts (CoT) wei2022chain. (b) Program of Thoughts (PoT) chen2022program. (c) ToRA gou2023tora that incorporates CoT, PoT, and the utilization of tools. (d) Our proposed CoSC consists of a sequence of multiple self-correction stages (two stages are shown in this example). Each stage has four sub-stages: (p1) LLMs generate program w.r.t. the question; (o1) execute the program to obtain program output; (v1) perform two-step verification for consistency of the question with both the generated program and the program output; (c1) conclude a refined answer or continue the next subsequent self-correction stage depending on the verification result. The final answer is extracted from the last conclusion sub-stage with regular expression matching.
  • Figure 2: The training of Chain of Self-Correction (CoSC) consists of two phases. The first phase, (a) CoSC Foundational Learning, trains LLMs with seeding data generated from proprietary models, equipping them with a baseline proficiency in the CoSC methodology. In particular, we prompt GPT-4 with training questions from MATH hendrycks2021measuring and GSM8K cobbe2021training datasets to generate mathematical reasoning trajectories that adhere to the CoSC protocol. The second phase, (b) CoSC Self Enhancement, further adapts the seed model obtained from the previous phase with self-generated trajectories. These trajectories are produced by the seed model trained in the foundational phase, thereby enabling the generation of a substantial volume of data without the need for additional GPT-4 intervention. In both phases, we only retain trajectories whose answers match the ground-truth label.