Table of Contents
Fetching ...

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip H. S. Torr, Mohamed Elhoseiny, Adel Bibi

TL;DR

BFPO tackles the safety-versus-helpfulness tension in LLM alignment by re-parameterizing a joint RLHF objective as a single supervised objective using an empirical labeling function that encodes global response rankings. It proves theoretical equivalence to multi-objective RLHF with a bilinear reward and provides an algorithm to optimize the BFPO loss, achieving strong harmlessness while preserving helpfulness on open models. Empirically, BFPO attains high harmlessness scores and substantial gains in safe generative behavior with publicly available data, reducing reliance on costly red-teaming. The approach offers a data-efficient, generalizable framework for safe-aligned LLMs and could extend to additional conflicting objectives in model alignment.

Abstract

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10\% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

TL;DR

BFPO tackles the safety-versus-helpfulness tension in LLM alignment by re-parameterizing a joint RLHF objective as a single supervised objective using an empirical labeling function that encodes global response rankings. It proves theoretical equivalence to multi-objective RLHF with a bilinear reward and provides an algorithm to optimize the BFPO loss, achieving strong harmlessness while preserving helpfulness on open models. Empirically, BFPO attains high harmlessness scores and substantial gains in safe generative behavior with publicly available data, reducing reliance on costly red-teaming. The approach offers a data-efficient, generalizable framework for safe-aligned LLMs and could extend to additional conflicting objectives in model alignment.

Abstract

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10\% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.
Paper Structure (30 sections, 7 theorems, 59 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 7 theorems, 59 equations, 5 figures, 10 tables, 1 algorithm.

Key Result

Theorem 3.1

The optimization problem in eq:hs_multiobjectiverl has a solution $\pi^*$ and $\pi^*(y)$ is the unique solution to the following optimization problem

Figures (5)

  • Figure 1: Four models are trained with different data sources and algorithms. Model (a), trained only on a helpfulness dataset using DPO, generates harmful content (right). Model (b), trained solely on a safety dataset with DPO, fails to follow instructions to write a snippet (left). Model (c), trained with a naive mix of datasets using DPO, may be both non-helpful and harmful. Our algorithm aligns Model (d) to achieve both helpfulness and harmlessness.
  • Figure 2: Global preference ranking of different responses.
  • Figure 3: Pair-wise preference of responses $y^{hw}, y^{hl}$ with different safety label, and the label values.
  • Figure 4: Action probabilities over steps during the policy optimization using DPO, IPO, and our BFPO in synthetic dataset. Only ours can recover the desired ranking.
  • Figure 5: Helpfulness and harmlessness of open sourced models. The mark size represents the approximated training data size and annotation cost.

Theorems & Definitions (10)

  • Theorem 3.1: ipo
  • Theorem 3.2
  • Proposition 3.3
  • Theorem B.1
  • Lemma B.2: dpo, ipo
  • Lemma B.3: Theorem 2 in ipo
  • proof
  • proof
  • Theorem B.4
  • proof