Table of Contents
Fetching ...

PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang

TL;DR

PACE challenges the scaling hypothesis of Best-of-$N$ exploration in Iterative Direct Preference Optimization for mathematical reasoning. It replaces brute-force mining with a three-phase pipeline—Proximal Exploration, Hindsight Refinement with Quality Gating, and Contrastive Pair Construction—operating with a minimal budget of $N=2$ to synthesize high-information, proximal training pairs. Theoretical analysis shows that high-$N$ sampling amplifies verifier noise and induces distributional shift, leading to instability, while PACE maintains robust learning and resilience to label noise. Empirically, PACE matches or surpasses DPO-R1 with $N=16$ at roughly $1/5$ the compute and demonstrates strong data efficiency and robustness across multiple mathematical benchmarks and models.

Abstract

Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.

PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

TL;DR

PACE challenges the scaling hypothesis of Best-of- exploration in Iterative Direct Preference Optimization for mathematical reasoning. It replaces brute-force mining with a three-phase pipeline—Proximal Exploration, Hindsight Refinement with Quality Gating, and Contrastive Pair Construction—operating with a minimal budget of to synthesize high-information, proximal training pairs. Theoretical analysis shows that high- sampling amplifies verifier noise and induces distributional shift, leading to instability, while PACE maintains robust learning and resilience to label noise. Empirically, PACE matches or surpasses DPO-R1 with at roughly the compute and demonstrates strong data efficiency and robustness across multiple mathematical benchmarks and models.

Abstract

Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., ) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget (), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 while using only about of the compute, demonstrating superior robustness against reward hacking and label noise.
Paper Structure (37 sections, 1 theorem, 34 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 34 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1.1

Consider a marginal task where a verified solution is obtained for the first time only at the $N$-th attempt (i.e., $D_N = \{1, N\}$). The expected false positive (FP) rate of such a sample, denoted by $\bar{\Psi}(N) = \mathbb{E}_{\alpha \mid D_N} [\Psi(\alpha)]$, is strictly monotonically increasin

Figures (8)

  • Figure 1: Overview of PACE vs. Standard Best-of-N DPO. (A) Standard BoN ($N=16$): Relies on high-compute sampling to mine positive signals, risking reward hacking where models match labels through flawed logic. (B) PACE (Ours): A three-phase pipeline: (I) Proximal Exploration ($N=2$) to minimize compute; (II) Hindsight Refinement with logical gating to synthesize verified corrections ($y_{fix}$) from failure traces ($y_{err}$); (III) Contrastive Construction of high-density pairs. PACE achieves superior reasoning alignment with $6\times$ lower overhead and higher resistance to label noise by prioritizing logical density over search breadth.
  • Figure 2: The Dynamics of Iterative Alignment on Llama-3.1-8B.
  • Figure 3: Data Topology Analysis: Hard vs. Easy Negatives.
  • Figure 4: Comparison between generation prompts (Llama3 and Qwen).
  • Figure 5: Reflection Prompts. The only difference between the models is the special tokens they use.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1.1: Monotonicity of Marginal False Positives
  • proof