Table of Contents
Fetching ...

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu

TL;DR

SPPD theoretically proves that SPPD is equivalent to on-policy policy gradient methods under reward constraints, and demonstrates superior performance across in-domain and out-domain mathematical benchmarks.

Abstract

Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level preference optimization, which employs tree-based self-sampling on model responses \textbf{without any distillation} from other models. Furthermore, we theoretically prove that SPPD is \textbf{equivalent to on-policy policy gradient methods} under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}{https://anonymous.4open.science/r/SPPD-DCDD}.

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

TL;DR

SPPD theoretically proves that SPPD is equivalent to on-policy policy gradient methods under reward constraints, and demonstrates superior performance across in-domain and out-domain mathematical benchmarks.

Abstract

Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level preference optimization, which employs tree-based self-sampling on model responses \textbf{without any distillation} from other models. Furthermore, we theoretically prove that SPPD is \textbf{equivalent to on-policy policy gradient methods} under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}{https://anonymous.4open.science/r/SPPD-DCDD}.

Paper Structure

This paper contains 21 sections, 8 theorems, 26 equations, 3 figures, 5 tables.

Key Result

Lemma 4.1

Under the step MDP definition in Section sec:step-dpo-mdp and fix solution for the maximum casual entropy problem (Equation (equ:fix_solution)), the optimal step reward function can be calculate as follow:

Figures (3)

  • Figure 1: The framework of SPPD: unlike CoT and MCTS, Tree-Based Self-Sampling generates step trajectories with common prefixes and significantly preserves the output distribution of the policy. The former provides step preference signals for SPPD, while the latter theoretically ensures consistency with on-policy gradient methods, thereby enabling self-enhancement of the model's reasoning capabilities.
  • Figure 2: Impact of $\gamma$ in dynamic value margin.
  • Figure 3: Skywork-o1-Open-PRM-Qwen-2.5-7B distribution.

Theorems & Definitions (15)

  • Lemma 4.1: Optimal Step Reward Function
  • Theorem 4.2: Step DPO Loss Using Dynamic Value Margin.
  • Lemma 4.3
  • Remark
  • Definition 5.1: Preference decoding model $\pi^p_\theta$ induced by $\pi_\theta$
  • Remark
  • Lemma 5.1: Online Policy Gradient on $\pi^p_\theta$ rpg
  • Theorem 5.2: Equivalence Between Offline Step DPO and Online Policy Gradient
  • Remark
  • Lemma D.1: Optimal Step Reward Function
  • ...and 5 more