SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

Hao Yi; Qingyang Li; Yulan Hu; Fuzheng Zhang; Di Zhang; Yong Liu

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu

TL;DR

SPPD theoretically proves that SPPD is equivalent to on-policy policy gradient methods under reward constraints, and demonstrates superior performance across in-domain and out-domain mathematical benchmarks.

Abstract

Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level preference optimization, which employs tree-based self-sampling on model responses \textbf{without any distillation} from other models. Furthermore, we theoretically prove that SPPD is \textbf{equivalent to on-policy policy gradient methods} under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}{https://anonymous.4open.science/r/SPPD-DCDD}.

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

TL;DR

Abstract

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)