Surgical Post-Training: Cutting Errors, Keeping Knowledge

Wenye Lin; Kai Han

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Wenye Lin, Kai Han

TL;DR

The Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge, and a reward-based binary cross-entropy objective, which treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals.

Abstract

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

Surgical Post-Training: Cutting Errors, Keeping Knowledge

TL;DR

Abstract

Paper Structure (43 sections, 12 equations, 9 figures, 3 tables)

This paper contains 43 sections, 12 equations, 9 figures, 3 tables.

Introduction
Data Rectification Pipeline
Error Elicitation
Oracle-Guided Surgical Rectification
LCS Filtering
The Reward Is Secretly a Regularizer
Empirical Observation: The Regularization Gap
SFT+
DPO
Reward-SFT
Gradient Analysis: The Mechanics of Tethering
The Surgical Optimization Objective
The "Pull-Up" Effect
The Insufficiency of Relative Ranking
Reward-based Binary Cross Entropy Optimization
...and 28 more sections

Figures (9)

Figure 1: Illustration of SPoT. Left: Our framework utilizes an Oracle to apply surgical rectifications to erroneous reasoning steps, generating positive samples that remain proximal to the model’s original distribution. Top-Right: Unlike the relative ranking used in DPO, we leverage an explicit classification loss that proves more effective for reasoning tasks. Bottom-Right: The tethering effect inherent in our reward definition effectively mitigates catastrophic forgetting.
Figure 2: IFEval Acc Results (avg@5). SFT+ forgets instruction following ability, while reward-based methods do not.
Figure 3: Training loss curve. Reward-SFT and DPO converge rapidly to near-zero as the policy satisfies the relative margin constraint. SFT+ remains high as it attempts to maximize absolute likelihood, driving continuous parameter updates.
Figure 4: Evolution of implicit rewards during training. Left: reward scores for chosen responses; Right: reward scores for rejected responses. The reward shows how much the model's preference for the selected response has increased compared to its initial stage.
Figure 5: Distributions of change ratio. A higher change ratio indicates that reasoning failures occur earlier in the reasoning chain. The distribution shape is determined by the model ability and the data difficulty together.
...and 4 more figures

Surgical Post-Training: Cutting Errors, Keeping Knowledge

TL;DR

Abstract

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (9)