Table of Contents
Fetching ...

Enhancing LLM Safety via Constrained Direct Preference Optimization

Zixuan Liu, Xiaolin Sun, Zizhan Zheng

TL;DR

The paper tackles the challenge of aligning LLMs to diverse human preferences while ensuring safety by decoupling reward (helpfulness) from safety (harmlessness) and avoiding costly RL. It introduces Constrained Direct Preference Optimization (C-DPO), an RL-free fine-tuning method that uses dual gradient descent over Direct Preference Optimization to maximize expected reward subject to a safety constraint, leveraging a Lagrangian formulation and a new BT-based preference model $p^*_{\lambda}$ derived from $r_{\lambda}=r-\lambda c$. The approach provides a safety guarantee and, empirically on Llama-2-7B, achieves higher constrained rewards than Safe RLHF baselines, with the dual variable $\lambda$ enabling effective trade-offs between helpfulness and harmlessness. The results suggest that C-DPO is a scalable, hardware-efficient alternative to reinforcement learning-based safe RLHF, offering practical impact for deploying safer, higher-quality LLMs.

Abstract

The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.

Enhancing LLM Safety via Constrained Direct Preference Optimization

TL;DR

The paper tackles the challenge of aligning LLMs to diverse human preferences while ensuring safety by decoupling reward (helpfulness) from safety (harmlessness) and avoiding costly RL. It introduces Constrained Direct Preference Optimization (C-DPO), an RL-free fine-tuning method that uses dual gradient descent over Direct Preference Optimization to maximize expected reward subject to a safety constraint, leveraging a Lagrangian formulation and a new BT-based preference model derived from . The approach provides a safety guarantee and, empirically on Llama-2-7B, achieves higher constrained rewards than Safe RLHF baselines, with the dual variable enabling effective trade-offs between helpfulness and harmlessness. The results suggest that C-DPO is a scalable, hardware-efficient alternative to reinforcement learning-based safe RLHF, offering practical impact for deploying safer, higher-quality LLMs.

Abstract

The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.
Paper Structure (24 sections, 2 theorems, 25 equations, 3 figures, 12 tables)

This paper contains 24 sections, 2 theorems, 25 equations, 3 figures, 12 tables.

Key Result

Proposition 1

(Strong duality of Problem (saferlhfobjective)). Let $\pi^{*}_{\theta}$ be the optimal primal variable such that $J_{r}(\pi^{*}_{\theta})=\underset{\pi_{\theta}}{\max}\{J_{r}(\pi_{\theta})|J_{c}(\pi_{\theta}) \leq C_{limit}\}$. Let $\lambda^{*}$ be the optimal dual variable where $\lambda^{*} = \und where $J(\pi_{\theta}, \lambda^{*})=J_{r}(\pi_{\theta})-\lambda^{*} J_{c}(\pi_{\theta})$. That is,

Figures (3)

  • Figure 1: Constrained DPO (C-DPO) method compared to DPO. Our method extends DPO to address the dual-objective alignment problem that jointly considers helpfulness and harmlessness, which cannot be directly solved by the original DPO. In particular, we introduce a new preference dataset $D_{r_{\lambda}}$ for each $\lambda$ and leverage the dual gradient descent technique to identify a nearly optimal policy.
  • Figure 2: The training curve for the Lagrange dual variable $\lambda$, the expected cost and expected reward during the C-DPO training when using 1000 prompts to evaluate the expected constraints violation.
  • Figure 3: The scatter figures show the distribution of the reward and cost of different LLMs on the test datasets, where the X-axis denotes the cost, and the Y-axis represents the reward. All models are evaluated by open-source BEAVERTAILS preference models.

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof