Alignment with Preference Optimization Is All You Need for LLM Safety

Reda Alami; Ali Khalifa Almansoori; Ahmed Alzubaidi; Mohamed El Amine Seddik; Mugariya Farooq; Hakim Hacid

Alignment with Preference Optimization Is All You Need for LLM Safety

Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid

TL;DR

This study identifies noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance and shows that alignment techniques can be sufficient for building safe and robust models.

Abstract

We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from $57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over $0.6$ to less than $0.07$. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

Alignment with Preference Optimization Is All You Need for LLM Safety

TL;DR

Abstract

) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over

to less than

. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

Paper Structure (16 sections, 6 equations, 5 figures, 6 tables)

This paper contains 16 sections, 6 equations, 5 figures, 6 tables.

Introduction
Related Work
Safety Evaluation for LLMs
Safety Enhancement for LLMs
The Safety Problem for LLMs
Safety Objectives:
Safety Alignment
Methodology
Dataset with the pairwise comparison for safe alignment
Safety Alignment Methods
Evaluations
ALERT
Adversarial ALERT
Toxicity
Results
...and 1 more sections

Figures (5)

Figure 1: Comparison of the global safety scores of $11$ LLMs. The scores are derived from averaging the results of the safety ALERT and safety Adversarial ALERT benchmarks to assess each model's overall performance across the safety evaluations. Notice the significant performance boost from $57.64\%$ to $99.9\%$ for the Falcon 11B model.
Figure 2: (a) $\mathbf{E}[\max_{\text{tox}}]$ + Benign
Figure 3: (b) $\mathbf{E}[\max_{\text{tox}}]$ + Adversarial
Figure 4: (c) $\text{avg}_{\text{tox}}$ + Benign
Figure 5: (d) $\text{avg}_{\text{tox}}$ + Adversarial

Theorems & Definitions (5)

Definition 1: Harmful Category
Definition 2: Adversarial Attacks
Definition 3: Safe/Unsafe output
Definition 4: Safety Score $S$
Definition 5: Attack Success Rate Score (ASR)

Alignment with Preference Optimization Is All You Need for LLM Safety

TL;DR

Abstract

Alignment with Preference Optimization Is All You Need for LLM Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (5)