Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Yining Li; Peizhong Ju; Ness Shroff

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Yining Li, Peizhong Ju, Ness Shroff

TL;DR

This work proposes a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods, and introduces an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics.

Abstract

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

TL;DR

Abstract

Paper Structure (32 sections, 13 theorems, 160 equations, 3 figures, 3 algorithms)

This paper contains 32 sections, 13 theorems, 160 equations, 3 figures, 3 algorithms.

Introduction
Preliminaries on Constrained RLHF
Constrained RLHF Problem
Lagrangian Method
A universal safe RLHF framework
Optimistic Primal--Dual Method
Example: Failure of Last-Iterate Convergence in a Bilinear Saddle-Point Problem
OPD in Distribution Space
OPD in Parameter Space
OPD Updates in the Parameterized Policy Space
A Toy RLHF Example Illustrating the Stability of OPD
Theoretical Results
Computational Experiments
Datasets and Reward Models
OPD implementation
...and 17 more sections

Key Result

Theorem 3.4

Under assump:feasibility, assump:bounded, and assump:reference_policy_full_support, under suitably chosen hyper-parameters $\eta_{\theta}$ and $\eta_{\lambda}$ (e.g., $\eta_{\theta}=\eta_{\lambda}=3\sqrt{|\mathcal{H}|}R_{\max}$), then the optimistic primal--dual iterates of eq:pi_teq:lambda_teq:hatp where $0<\rho<1$ is defined in eq:rho_valued and $\Phi_1$ is a costant defined as eq:Phi_1_valued.

Figures (3)

Figure 1: Comparison of OPD and PD under a softmax tabular parameterization in a single-state, two-action RLHF toy problem. OPD (red) converges to the optimal solution in the last iterate, while PD (blue) exhibits persistent oscillations and fails to converge.
Figure 2: Comparison of PD and OPD on reward and constrained reward during the training phase.
Figure 3: Inference comparison of PD and OPD on reward and cost.

Theorems & Definitions (23)

Theorem 3.4
Remark 3.5: Equivalence between Distribution-Space OPD and NPG Updates
Remark 3.6: Relationship to PPO in Practice
Corollary 3.10
Lemma B.1: Hölder's inequality
Lemma B.2: Pinsker's inequality (discrete form)
Lemma B.3: Young's inequality
Lemma B.4
proof
Lemma B.5
...and 13 more

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

TL;DR

Abstract

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)