Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip H. S. Torr, Mohamed Elhoseiny, Adel Bibi
TL;DR
BFPO tackles the safety-versus-helpfulness tension in LLM alignment by re-parameterizing a joint RLHF objective as a single supervised objective using an empirical labeling function that encodes global response rankings. It proves theoretical equivalence to multi-objective RLHF with a bilinear reward and provides an algorithm to optimize the BFPO loss, achieving strong harmlessness while preserving helpfulness on open models. Empirically, BFPO attains high harmlessness scores and substantial gains in safe generative behavior with publicly available data, reducing reliance on costly red-teaming. The approach offers a data-efficient, generalizable framework for safe-aligned LLMs and could extend to additional conflicting objectives in model alignment.
Abstract
Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10\% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.
