Enhancing LLM Safety via Constrained Direct Preference Optimization
Zixuan Liu, Xiaolin Sun, Zizhan Zheng
TL;DR
The paper tackles the challenge of aligning LLMs to diverse human preferences while ensuring safety by decoupling reward (helpfulness) from safety (harmlessness) and avoiding costly RL. It introduces Constrained Direct Preference Optimization (C-DPO), an RL-free fine-tuning method that uses dual gradient descent over Direct Preference Optimization to maximize expected reward subject to a safety constraint, leveraging a Lagrangian formulation and a new BT-based preference model $p^*_{\lambda}$ derived from $r_{\lambda}=r-\lambda c$. The approach provides a safety guarantee and, empirically on Llama-2-7B, achieves higher constrained rewards than Safe RLHF baselines, with the dual variable $\lambda$ enabling effective trade-offs between helpfulness and harmlessness. The results suggest that C-DPO is a scalable, hardware-efficient alternative to reinforcement learning-based safe RLHF, offering practical impact for deploying safer, higher-quality LLMs.
Abstract
The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.
