Table of Contents
Fetching ...

Course-Correction: Safety Alignment Using Synthetic Preferences

Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu

TL;DR

This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, and creates C^2-Syn, a synthetic C^2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.

Abstract

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C$^2$-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C$^2$-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

Course-Correction: Safety Alignment Using Synthetic Preferences

TL;DR

This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, and creates C^2-Syn, a synthetic C^2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.

Abstract

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.
Paper Structure (36 sections, 3 equations, 15 figures, 16 tables, 1 algorithm)

This paper contains 36 sections, 3 equations, 15 figures, 16 tables, 1 algorithm.

Figures (15)

  • Figure 1: An illustrative example of course-correction. (a) The model returns an unsafe response to the harmful request. (b) The model initially provides an unsafe response but subsequently performs a timely correction, a process known as course-correction.
  • Figure 2: An illustration of evaluating course-correction ability. The tested model is fed with an input of the concatenation of the harmful request HR and the initial harmful response IHR. <user_start>, <user_end> and <ai_start>, <ai_start> wrap the content of the user prompt and model response, respectively.
  • Figure 3: $\texttt{Corr}@k$ for tested LLMs on C$^2$-Eval.
  • Figure 4: Illustration of generating preferences data in C$^2$-Syn. We synthesize self-contained preferences based on the harmful request HR and the full harmful response FHR using two value principles. denotes a well-aligned LLM ($\mathcal{M}_{\text{aligned}}$), we select Llama2-Chat 7B for this purpose. See Appendix Table \ref{['tab:annotation-sample']} for a detailed example.
  • Figure 5: Summed probability of safety tokens at the first decoding position after an IHR of length $k$.
  • ...and 10 more figures