Table of Contents
Fetching ...

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Yining Li, Peizhong Ju, Ness Shroff

TL;DR

This work proposes a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods, and introduces an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics.

Abstract

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

TL;DR

This work proposes a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods, and introduces an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics.

Abstract

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.
Paper Structure (32 sections, 13 theorems, 160 equations, 3 figures, 3 algorithms)

This paper contains 32 sections, 13 theorems, 160 equations, 3 figures, 3 algorithms.

Key Result

Theorem 3.4

Under assump:feasibility, assump:bounded, and assump:reference_policy_full_support, under suitably chosen hyper-parameters $\eta_{\theta}$ and $\eta_{\lambda}$ (e.g., $\eta_{\theta}=\eta_{\lambda}=3\sqrt{|\mathcal{H}|}R_{\max}$), then the optimistic primal--dual iterates of eq:pi_teq:lambda_teq:hatp where $0<\rho<1$ is defined in eq:rho_valued and $\Phi_1$ is a costant defined as eq:Phi_1_valued.

Figures (3)

  • Figure 1: Comparison of OPD and PD under a softmax tabular parameterization in a single-state, two-action RLHF toy problem. OPD (red) converges to the optimal solution in the last iterate, while PD (blue) exhibits persistent oscillations and fails to converge.
  • Figure 2: Comparison of PD and OPD on reward and constrained reward during the training phase.
  • Figure 3: Inference comparison of PD and OPD on reward and cost.

Theorems & Definitions (23)

  • Theorem 3.4
  • Remark 3.5: Equivalence between Distribution-Space OPD and NPG Updates
  • Remark 3.6: Relationship to PPO in Practice
  • Corollary 3.10
  • Lemma B.1: Hölder's inequality
  • Lemma B.2: Pinsker's inequality (discrete form)
  • Lemma B.3: Young's inequality
  • Lemma B.4
  • proof
  • Lemma B.5
  • ...and 13 more