Table of Contents
Fetching ...

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Xiyue Peng, Hengquan Guo, Jiawei Zhang, Dongqing Zou, Ziyu Shao, Honghao Wei, Xin Liu

TL;DR

The paper tackles safety alignment in RLHF, identifying safety compensation as a risk when safety is enforced via an expected constraint. It introduces Rectified Policy Optimization (RePO), which imposes a per-prompt critical safety constraint $C(x,y) \le 0$ and uses a rectified policy gradient with a rectified penalty $\{C(x,y)\}^+$ to guide updates, forming a min–max objective $L(\pi_\theta, \lambda)$ that preserves helpfulness when safety is guaranteed. Core contributions include a formal rectified reformulation, token-level decomposition of rewards and costs, PPO-style clipped objectives with safe/unsafe batching, and an empirical demonstration that RePO delivers stronger safety alignment than methods optimizing expected safety, on Alpaca-7B and Llama3.2-3B. The results suggest that enforcing per-prompt safety can yield safer, more reliable LLMs without sacrificing performance, with potential broad impact for safer deployment of RLHF systems.

Abstract

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed "safety compensation", where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments demonstrate that RePO outperforms strong baseline methods and significantly enhances LLM safety alignment.

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

TL;DR

The paper tackles safety alignment in RLHF, identifying safety compensation as a risk when safety is enforced via an expected constraint. It introduces Rectified Policy Optimization (RePO), which imposes a per-prompt critical safety constraint and uses a rectified policy gradient with a rectified penalty to guide updates, forming a min–max objective that preserves helpfulness when safety is guaranteed. Core contributions include a formal rectified reformulation, token-level decomposition of rewards and costs, PPO-style clipped objectives with safe/unsafe batching, and an empirical demonstration that RePO delivers stronger safety alignment than methods optimizing expected safety, on Alpaca-7B and Llama3.2-3B. The results suggest that enforcing per-prompt safety can yield safer, more reliable LLMs without sacrificing performance, with potential broad impact for safer deployment of RLHF systems.

Abstract

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed "safety compensation", where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments demonstrate that RePO outperforms strong baseline methods and significantly enhances LLM safety alignment.

Paper Structure

This paper contains 23 sections, 1 theorem, 22 equations, 4 figures, 8 tables.

Key Result

Theorem 1

The critical constrained MDP problem eq:objective-eq:critical-constraint is equivalent to the following min-max rectified formulation: where $\{\cdot\}^+ = \max\{\cdot, 0\}$ represents the rectification operator.

Figures (4)

  • Figure 1: Pitfalls of Expected Safety Constraints and Mitigation via Critical Safety Constraints. The left plot illustrates that an LM that is expected safe is not necessarily critical safe, i.e., $\pi_\theta\in\Pi_\text{expected}\setminus\Pi_\text{critcial},$ where the formulation of expected safety constraints is likely to end s up with the pitfalls of safety compensation. The right plots compare the average costs and the number of unsafe samples during fine-tuning processes for the initial models within or outside $\Pi_{\text{expected}}$. The plots justify that the formulation of strict safety constraints can effectively address the pitfalls and enhance LLM safety significantly.
  • Figure 2: The comparison between RePO and baselines by GPT-4.
  • Figure 3: The fine-tuning Alpaca-7B log of SafeRLHF and RePO on different initial training datasets from average costs. The training was conducted independently for five rounds with different seeds, and the results show the mean and standard deviation from the five experiments. The first line is the cost score distribution of response-prompt pairs generated by Alpaca-7B. We selected 3 representative datasets, for which Alpaca-7B is expected unsafe, nearly expected safe, and expected safe over the datasets. The S.R. indicates the safety rate of the pairs over each training dataset. The second line is the average cost curve during the fine-tuning and the dashed line is the constraint cost threshold. The current LM is expected safe over the training batch if the average cost is under the line. The third line is the number of unsafe samples in the current training batch (128 samples per batch in total). A sample is unsafe if and only if the prompt-response pair generated by the current LM is greater than 0.
  • Figure 4: The scatter plot illustrates the cost-reward distribution of initial models and the resulting models with different algorithms. The reward indicates the helpfulness, cost indicates the harmlessness. It's safe if and only if the cost is no gather than 0.

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Remark 1