Table of Contents
Fetching ...

Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

Yihan Du, Seo Taek Kong, R. Srikant

TL;DR

The paper tackles constrained alignment of LLMs by maximizing a reward signal while keeping harmful content under a threshold. It introduces PD-DPO, which first learns a reward-informed policy via standard DPO and then uses a rearranged Lagrangian DPO on cost data guided by that policy, avoiding explicit reward/cost models or prior knowledge of the optimal multiplier. The authors provide suboptimality and constraint-violation guarantees and extend the approach online with exploration bonuses, removing dependence on offline data coverage. Empirical results on PKU-SafeRLHF show PD-DPO achieves strong helpfulness with improved safety relative to baselines, while offering substantial memory and computation advantages over model-based safety methods. The work thus delivers a theoretically sound, scalable framework for constrained LLM alignment with practical online extensions.

Abstract

The widespread application of Large Language Models (LLMs) imposes increasing demands on safety, such as reducing harmful content and fake information, and avoiding certain forbidden tokens due to rules and laws. While there have been several recent works studying safe alignment of LLMs, these works either require the training of reward and cost models and incur high memory and computational costs, or need prior knowledge about the optimal solution. Motivated by this fact, we study the problem of constrained alignment in LLMs, i.e., maximizing the output reward while restricting the cost due to potentially unsafe content to stay below a threshold. For this problem, we propose a novel primal-dual DPO approach, which first trains a model using standard DPO on reward preference data to provide reward information, and then adopts a rearranged Lagrangian DPO objective utilizing the provided reward information to fine-tune LLMs on cost preference data. Our approach significantly reduces memory and computational costs, and does not require extra prior knowledge. Moreover, we establish rigorous theoretical guarantees on the suboptimality and constraint violation of the output policy. We also extend our approach to an online data setting by incorporating exploration bonuses, which enables our approach to explore uncovered prompt-response space, and then provide theoretical results that get rid of the dependence on preference data coverage. Experimental results on the widely-used preference dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.

Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

TL;DR

The paper tackles constrained alignment of LLMs by maximizing a reward signal while keeping harmful content under a threshold. It introduces PD-DPO, which first learns a reward-informed policy via standard DPO and then uses a rearranged Lagrangian DPO on cost data guided by that policy, avoiding explicit reward/cost models or prior knowledge of the optimal multiplier. The authors provide suboptimality and constraint-violation guarantees and extend the approach online with exploration bonuses, removing dependence on offline data coverage. Empirical results on PKU-SafeRLHF show PD-DPO achieves strong helpfulness with improved safety relative to baselines, while offering substantial memory and computation advantages over model-based safety methods. The work thus delivers a theoretically sound, scalable framework for constrained LLM alignment with practical online extensions.

Abstract

The widespread application of Large Language Models (LLMs) imposes increasing demands on safety, such as reducing harmful content and fake information, and avoiding certain forbidden tokens due to rules and laws. While there have been several recent works studying safe alignment of LLMs, these works either require the training of reward and cost models and incur high memory and computational costs, or need prior knowledge about the optimal solution. Motivated by this fact, we study the problem of constrained alignment in LLMs, i.e., maximizing the output reward while restricting the cost due to potentially unsafe content to stay below a threshold. For this problem, we propose a novel primal-dual DPO approach, which first trains a model using standard DPO on reward preference data to provide reward information, and then adopts a rearranged Lagrangian DPO objective utilizing the provided reward information to fine-tune LLMs on cost preference data. Our approach significantly reduces memory and computational costs, and does not require extra prior knowledge. Moreover, we establish rigorous theoretical guarantees on the suboptimality and constraint violation of the output policy. We also extend our approach to an online data setting by incorporating exploration bonuses, which enables our approach to explore uncovered prompt-response space, and then provide theoretical results that get rid of the dependence on preference data coverage. Experimental results on the widely-used preference dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 23 sections, 20 theorems, 144 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 1

With probability at least $1-\delta$, for any $K\geq1$, the output policy $\pi^{\textup{out}}_K$ of algorithm $\mathtt{PD\hbox{-}DPO}$ satisfies

Figures (1)

  • Figure 1: (Left) Rewards and negated costs of responses generated by compared language models when evaluated by Beaver-7b-unified-reward and Beaver-7b-unified-cost daisafe. (Right) Helpfulness and harmlessness Elo scores evaluated by GPT-4. Our $\mathtt{PD\hbox{-}DPO}$ model achieves the best helpfulness performance while reducing harmful content generation compared to the SFT model.

Theorems & Definitions (39)

  • Theorem 1: Result of Algorithm $\mathtt{PD\hbox{-}DPO}$
  • Theorem 2: Result of Algorithm $\mathtt{O\hbox{-}PD\hbox{-}DPO}$
  • Theorem 3: Connection between Standard DPO and Standard RLHF with Constrained Policy Ranges
  • proof
  • Theorem 4: Connection between Our Rearranged Lagrangian DPO and Safe RLHF
  • proof : Proof of Theorem \ref{['thm:equivalence_c']}
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 29 more