Table of Contents
Fetching ...

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Wenye Lin, Kai Han

TL;DR

The Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge, and a reward-based binary cross-entropy objective, which treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals.

Abstract

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

Surgical Post-Training: Cutting Errors, Keeping Knowledge

TL;DR

The Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge, and a reward-based binary cross-entropy objective, which treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals.

Abstract

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
Paper Structure (43 sections, 12 equations, 9 figures, 3 tables)

This paper contains 43 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Illustration of SPoT. Left: Our framework utilizes an Oracle to apply surgical rectifications to erroneous reasoning steps, generating positive samples that remain proximal to the model’s original distribution. Top-Right: Unlike the relative ranking used in DPO, we leverage an explicit classification loss that proves more effective for reasoning tasks. Bottom-Right: The tethering effect inherent in our reward definition effectively mitigates catastrophic forgetting.
  • Figure 2: IFEval Acc Results (avg@5). SFT+ forgets instruction following ability, while reward-based methods do not.
  • Figure 3: Training loss curve. Reward-SFT and DPO converge rapidly to near-zero as the policy satisfies the relative margin constraint. SFT+ remains high as it attempts to maximize absolute likelihood, driving continuous parameter updates.
  • Figure 4: Evolution of implicit rewards during training. Left: reward scores for chosen responses; Right: reward scores for rejected responses. The reward shows how much the model's preference for the selected response has increased compared to its initial stage.
  • Figure 5: Distributions of change ratio. A higher change ratio indicates that reasoning failures occur earlier in the reasoning chain. The distribution shape is determined by the model ability and the data difficulty together.
  • ...and 4 more figures