Enhancing LLM Safety via Constrained Direct Preference Optimization

Zixuan Liu; Xiaolin Sun; Zizhan Zheng

Enhancing LLM Safety via Constrained Direct Preference Optimization

Zixuan Liu, Xiaolin Sun, Zizhan Zheng

TL;DR

The paper tackles the challenge of aligning LLMs to diverse human preferences while ensuring safety by decoupling reward (helpfulness) from safety (harmlessness) and avoiding costly RL. It introduces Constrained Direct Preference Optimization (C-DPO), an RL-free fine-tuning method that uses dual gradient descent over Direct Preference Optimization to maximize expected reward subject to a safety constraint, leveraging a Lagrangian formulation and a new BT-based preference model $p^*_{\lambda}$ derived from $r_{\lambda}=r-\lambda c$. The approach provides a safety guarantee and, empirically on Llama-2-7B, achieves higher constrained rewards than Safe RLHF baselines, with the dual variable $\lambda$ enabling effective trade-offs between helpfulness and harmlessness. The results suggest that C-DPO is a scalable, hardware-efficient alternative to reinforcement learning-based safe RLHF, offering practical impact for deploying safer, higher-quality LLMs.

Abstract

The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.

Enhancing LLM Safety via Constrained Direct Preference Optimization

TL;DR

derived from

. The approach provides a safety guarantee and, empirically on Llama-2-7B, achieves higher constrained rewards than Safe RLHF baselines, with the dual variable

enabling effective trade-offs between helpfulness and harmlessness. The results suggest that C-DPO is a scalable, hardware-efficient alternative to reinforcement learning-based safe RLHF, offering practical impact for deploying safer, higher-quality LLMs.

Abstract

Paper Structure (24 sections, 2 theorems, 25 equations, 3 figures, 12 tables)

This paper contains 24 sections, 2 theorems, 25 equations, 3 figures, 12 tables.

Introduction
Preliminaries
Reinforcement Learning From Human Feedback (RLHF)
Safe RLHF
Method
Experiments
Appendix
Discussion
Analytical Results
Strong Duality of Safe RLHF
Deriving the Optimum to the Unconstrained Objective
Equivalence of safe RLHF and Maximum Likelihood Objective
Deriving the gradient of dual function
Related Work
Details About the Constrained DPO (C-DPO) Algorithm
...and 9 more sections

Key Result

Proposition 1

(Strong duality of Problem (saferlhfobjective)). Let $\pi^{*}_{\theta}$ be the optimal primal variable such that $J_{r}(\pi^{*}_{\theta})=\underset{\pi_{\theta}}{\max}\{J_{r}(\pi_{\theta})|J_{c}(\pi_{\theta}) \leq C_{limit}\}$. Let $\lambda^{*}$ be the optimal dual variable where $\lambda^{*} = \und where $J(\pi_{\theta}, \lambda^{*})=J_{r}(\pi_{\theta})-\lambda^{*} J_{c}(\pi_{\theta})$. That is,

Figures (3)

Figure 1: Constrained DPO (C-DPO) method compared to DPO. Our method extends DPO to address the dual-objective alignment problem that jointly considers helpfulness and harmlessness, which cannot be directly solved by the original DPO. In particular, we introduce a new preference dataset $D_{r_{\lambda}}$ for each $\lambda$ and leverage the dual gradient descent technique to identify a nearly optimal policy.
Figure 2: The training curve for the Lagrange dual variable $\lambda$, the expected cost and expected reward during the C-DPO training when using 1000 prompts to evaluate the expected constraints violation.
Figure 3: The scatter figures show the distribution of the reward and cost of different LLMs on the test datasets, where the X-axis denotes the cost, and the Y-axis represents the reward. All models are evaluated by open-source BEAVERTAILS preference models.

Theorems & Definitions (4)

Proposition 1
proof
Proposition 2
proof

Enhancing LLM Safety via Constrained Direct Preference Optimization

TL;DR

Abstract

Enhancing LLM Safety via Constrained Direct Preference Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)