Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

Kaishuai Xu; Tiezheng Yu; Wenjun Hou; Yi Cheng; Chak Tou Leong; Liangyou Li; Xin Jiang; Lifeng Shang; Qun Liu; Wenjie Li

Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Chak Tou Leong, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li

TL;DR

RISE addresses inner-step subtle errors in mathematical reasoning by injecting controlled errors into reasoning steps and training with a subtle-error-aware Direct Preference Optimization objective that uses both self-edited and full-solution pairs. The approach avoids additional annotation while delivering robust improvements across mathematical datasets and general reasoning tasks, including logical reasoning and code generation, on multiple LLMs. By targeting error tokens and stabilizing training with an adaptive NLL term, RISE yields consistent gains on GSM8K, MATH, and Odyssey-MATH and demonstrates broader generalization beyond pure math. This work offers a practical pathway to reduce subtle reasoning mistakes in LLMs and broadens the applicability of preference-based training to cross-domain reasoning tasks.

Abstract

Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs' full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.

Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

TL;DR

Abstract

Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)