Table of Contents
Fetching ...

Bootstrapping LLMs via Preference-Based Policy Optimization

Chen Jia

TL;DR

PbPO tackles aligning LLMs to human preferences by online bootstrapping through a min–max interaction between a policy and a reward model constrained by a confidence set derived from preferences. It unifies reward-agnostic exploration with reward-aware exploitation, providing theoretical regret guarantees for both sequence-level and token-level reward models and demonstrating strong empirical gains against state-of-the-art baselines on five benchmarks. The approach leverages Stackelberg-style relaxation and gradient-based adversarial training to enable practical, scalable online RLHF-style refinement. The results suggest that iterative, preference-informed bootstrapping can yield robust alignment with human preferences while mitigating reward misspecification and overfitting risks. This framework offers a principled path for continuously improving LLM behavior with reduced reliance on static annotated data.

Abstract

Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.

Bootstrapping LLMs via Preference-Based Policy Optimization

TL;DR

PbPO tackles aligning LLMs to human preferences by online bootstrapping through a min–max interaction between a policy and a reward model constrained by a confidence set derived from preferences. It unifies reward-agnostic exploration with reward-aware exploitation, providing theoretical regret guarantees for both sequence-level and token-level reward models and demonstrating strong empirical gains against state-of-the-art baselines on five benchmarks. The approach leverages Stackelberg-style relaxation and gradient-based adversarial training to enable practical, scalable online RLHF-style refinement. The results suggest that iterative, preference-informed bootstrapping can yield robust alignment with human preferences while mitigating reward misspecification and overfitting risks. This framework offers a principled path for continuously improving LLM behavior with reduced reliance on static annotated data.

Abstract

Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.

Paper Structure

This paper contains 24 sections, 18 theorems, 113 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For any $\delta \in (0,1]$, let $\zeta_k = \mathcal{O}(d\log(Bk/\delta))$ for any $k \in \{1,2,\ldots,K\}$ with a maximum episode number of $K$, then under Assumptions asp:linearboundseq & asp:reaseq, setting $\gamma = \sqrt{\log\left( 1 + 4K/(c_2d^2 \log(K/\delta)) \right)}$ we have with probabilit where $c_1, c_2 > 0$ denote some universal constants.

Figures (2)

  • Figure 1: PbPO framework for bootstrapping LLMs. At each episode $k \in {1,2,\ldots,K}$: (I) Reward-agnostic exploration: collect new preference data using the current reference policy $\pi^k_{\rm ref}$ and an exploration-enhancing policy $\hat{\pi}^k$. (II) Reward-aware exploration & exploitation: update the main LLM policy $\pi^k$ via a min-max objective using the reward model trained on collected preferences.
  • Figure 2: Experimental analysis based on LLaMA2-7B backbone.

Theorems & Definitions (34)

  • Theorem 1: Regret bound with sequence-level RM
  • Remark 1
  • Corollary 1: Sample complexity with sequence-level RM
  • Theorem 2: Regret lower bound with sequence-level RM
  • Theorem 3: Regret bound with token-level RM
  • Remark 2
  • Corollary 2: Sample complexity with token-level RM
  • Theorem 4: Regret lower bound with token-level RM
  • Proposition 1: Confidence
  • proof
  • ...and 24 more