Table of Contents
Fetching ...

Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models

Yingshui Tan, Yilei Jiang, Yanshi Li, Jiaheng Liu, Xingyuan Bu, Wenbo Su, Xiangyu Yue, Xiaoyong Zhu, Bo Zheng

TL;DR

This work tackles the persistent tension between safety and helpfulness in RLHF fine-tuning of large language models, showing that naïvely scaling safety data can push models into an over-safe state that hurts usefulness. It proposes Equilibrate RLHF, combining Fine-grained Data-centric (FDC) safety data curation with Adaptive Message-wise Alignment (AMA) to selectively emphasize safety-critical segments. Empirical results demonstrate that modest, well-structured safety data paired with AMA outperform data-hungry baselines while preserving broad helpfulness, and that retrieval-augmented enhancements plus self-reflection can push explicit harmful data handling beyond traditional limits. The framework lays groundwork for safer and more helpful LLMs and points to future extensions into multimodal models and broader safety evaluation.

Abstract

Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignment of LLM is trained on data with safety-related categories. However, our experiments find that naively increasing the scale of safety training data usually leads the LLMs to an ``overly safe'' state rather than a ``truly safe'' state, boosting the refusal rate through extensive safety-aligned data without genuinely understanding the requirements for safe responses. Such an approach can inadvertently diminish the models' helpfulness. To understand the phenomenon, we first investigate the role of safety data by categorizing them into three different groups, and observe that each group behaves differently as training data scales up. To boost the balance between safety and helpfulness, we propose an Equilibrate RLHF framework including a Fine-grained Data-centric (FDC) approach that achieves better safety alignment even with fewer training data, and an Adaptive Message-wise Alignment (AMA) approach, which selectively highlight the key segments through a gradient masking strategy. Extensive experimental results demonstrate that our approach significantly enhances the safety alignment of LLMs while balancing safety and helpfulness.

Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models

TL;DR

This work tackles the persistent tension between safety and helpfulness in RLHF fine-tuning of large language models, showing that naïvely scaling safety data can push models into an over-safe state that hurts usefulness. It proposes Equilibrate RLHF, combining Fine-grained Data-centric (FDC) safety data curation with Adaptive Message-wise Alignment (AMA) to selectively emphasize safety-critical segments. Empirical results demonstrate that modest, well-structured safety data paired with AMA outperform data-hungry baselines while preserving broad helpfulness, and that retrieval-augmented enhancements plus self-reflection can push explicit harmful data handling beyond traditional limits. The framework lays groundwork for safer and more helpful LLMs and points to future extensions into multimodal models and broader safety evaluation.

Abstract

Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignment of LLM is trained on data with safety-related categories. However, our experiments find that naively increasing the scale of safety training data usually leads the LLMs to an ``overly safe'' state rather than a ``truly safe'' state, boosting the refusal rate through extensive safety-aligned data without genuinely understanding the requirements for safe responses. Such an approach can inadvertently diminish the models' helpfulness. To understand the phenomenon, we first investigate the role of safety data by categorizing them into three different groups, and observe that each group behaves differently as training data scales up. To boost the balance between safety and helpfulness, we propose an Equilibrate RLHF framework including a Fine-grained Data-centric (FDC) approach that achieves better safety alignment even with fewer training data, and an Adaptive Message-wise Alignment (AMA) approach, which selectively highlight the key segments through a gradient masking strategy. Extensive experimental results demonstrate that our approach significantly enhances the safety alignment of LLMs while balancing safety and helpfulness.

Paper Structure

This paper contains 26 sections, 9 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Examples of "truly safe" and "over safe", where Acetaminophen (Paracetamol) is a common over-the-counter medication used to relieve pain and reduce fever, while Acetomorphine (Diacetylmorphine) is a semi-synthetic opioid, also known as Heroin, which is a prohibited narcotic.
  • Figure 2: Examples of Two Causes Leading to Unsafe Responses from LLMs.
  • Figure 3: System Flow Diagram of our proposed Equilibrate RLHF Framework
  • Figure 4: The experiment results across different number of safety-related training data, mixed with about 260000 training data in general ability. We (harmless response ratio) in different harmful prompts (EHD, IHD, MHD) are reported. In addition, the safety score in real-world harmful data is also reported, named "natural". This experiment is done based on Qwen2-7B-instruct model. The helpfulness score is a average of the objective scores on 11 different open-sourced datasets.
  • Figure 5: The experiment results across different safety data distributions. In each picture, the number of IHD and MHD is fixed and the number of EHD gradually increase.
  • ...and 15 more figures