Table of Contents
Fetching ...

Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Chak Tou Leong, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li

TL;DR

RISE addresses inner-step subtle errors in mathematical reasoning by injecting controlled errors into reasoning steps and training with a subtle-error-aware Direct Preference Optimization objective that uses both self-edited and full-solution pairs. The approach avoids additional annotation while delivering robust improvements across mathematical datasets and general reasoning tasks, including logical reasoning and code generation, on multiple LLMs. By targeting error tokens and stabilizing training with an adaptive NLL term, RISE yields consistent gains on GSM8K, MATH, and Odyssey-MATH and demonstrates broader generalization beyond pure math. This work offers a practical pathway to reduce subtle reasoning mistakes in LLMs and broadens the applicability of preference-based training to cross-domain reasoning tasks.

Abstract

Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs' full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.

Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

TL;DR

RISE addresses inner-step subtle errors in mathematical reasoning by injecting controlled errors into reasoning steps and training with a subtle-error-aware Direct Preference Optimization objective that uses both self-edited and full-solution pairs. The approach avoids additional annotation while delivering robust improvements across mathematical datasets and general reasoning tasks, including logical reasoning and code generation, on multiple LLMs. By targeting error tokens and stabilizing training with an adaptive NLL term, RISE yields consistent gains on GSM8K, MATH, and Odyssey-MATH and demonstrates broader generalization beyond pure math. This work offers a practical pathway to reduce subtle reasoning mistakes in LLMs and broadens the applicability of preference-based training to cross-domain reasoning tasks.

Abstract

Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs' full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.

Paper Structure

This paper contains 35 sections, 3 equations, 6 figures, 31 tables, 1 algorithm.

Figures (6)

  • Figure 1: Error distribution for results of Qwen2-7B.
  • Figure 2: Preference learning framework augmented by error-injected self-editing. Each mathematical problem is sent to the original model to sample $K$ solutions, with correct and incorrect solutions in rectangles with blue and red borders. For one correct solution, we inject errors into each step of the solution and collect self-edited pairs. We also select an incorrect solution paired with the above correct one as full-solution pairs. Both sampling and self-editing are performed by the same model.
  • Figure 3: Error-injected self-editing prompt and some error injection examples. We display three error-injected self-editing operations: "REPLACE", "SWAP", and "DELETE".
  • Figure 4: Error analysis across three Qwen2-7B-based models. We display the number of different types of errors when addressing the MATH dataset, where "Others" represents those fall outside the scope of consideration.
  • Figure 5: Effect of different numbers of self-edited pairs. "All" indicates the use of all self-edited pairs.
  • ...and 1 more figures