Table of Contents
Fetching ...

LLM Safety Alignment is Divergence Estimation in Disguise

Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing

TL;DR

This work reframes LLM safety alignment as divergence estimation between aligned and unaligned response distributions, unifying RLHF, DPO, KTO, and BCO under a single probabilistic lens. It introduces KLDO, a KL-divergence-based alignment objective, and a broader FDO family, providing theoretical guarantees via alignment-consistency and linking these to the separation of safe vs. harmful prompts in latent space. The authors show that CR data yield stronger separation and robustness than Pref data, and they validate these predictions with extensive experiments across multiple model families, demonstrating that greater latent separation correlates with improved safety. The framework offers a principled path to designing new alignment objectives that improve safety without sacrificing utility, with implications for scalable, robust alignment in real-world systems.

Abstract

We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

LLM Safety Alignment is Divergence Estimation in Disguise

TL;DR

This work reframes LLM safety alignment as divergence estimation between aligned and unaligned response distributions, unifying RLHF, DPO, KTO, and BCO under a single probabilistic lens. It introduces KLDO, a KL-divergence-based alignment objective, and a broader FDO family, providing theoretical guarantees via alignment-consistency and linking these to the separation of safe vs. harmful prompts in latent space. The authors show that CR data yield stronger separation and robustness than Pref data, and they validate these predictions with extensive experiments across multiple model families, demonstrating that greater latent separation correlates with improved safety. The framework offers a principled path to designing new alignment objectives that improve safety without sacrificing utility, with implications for scalable, robust alignment in real-world systems.

Abstract

We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

Paper Structure

This paper contains 47 sections, 4 theorems, 59 equations, 13 figures, 8 tables.

Key Result

Theorem 4.1

Alignment losses in sec: rlhf methds satisfy: where $\theta^*=\arg\inf \mathcal{L}(\theta)$ for respective alignment loss $\mathcal{L}$.

Figures (13)

  • Figure 1: Latent space separation by prompt safety in an aligned model (right: Qwen2.5-Instruct) compared to its unaligned counterpart (left: Qwen2.5-base).
  • Figure 2: Unified divergence-estimation view of alignment. Alignment methods can be interpreted as estimating divergences between aligned ($\mathcal{D}^+$) and unaligned ($\mathcal{D}^-$) response distributions. Different choices of divergence recover prior methods (e.g., DPO, KTO, BCO) as special cases, while the same principle enables new objectives (KLDO, FDO). This unified perspective demystifies learning mechanism of alignment by contrasting between safe/preferred and unsafe/less-preferred responses, separation phenomenon, etc.
  • Figure 3: Illustrative example for data generation model.
  • Figure 4: Ability of various divergence metrics to distinguish between separate clusters.
  • Figure 5: Latent Space Visualization after various alignment methods for Qwen-2.5-1.5B.
  • ...and 8 more figures

Theorems & Definitions (14)

  • Theorem 4.1
  • Definition 4.2: Alignment Consistent
  • Theorem 4.3
  • Remark 4.4: Is DPO Alignment Consistent?
  • Theorem 4.5: Separation
  • Definition A.1: KTO reference constant $z_0$
  • Definition A.2: BCO reference constant $\delta$
  • Definition A.3: $f$-Divergence
  • Definition A.4: Convex Conjugate
  • proof : Proof of Thm \ref{['thm: divergence convergence']}
  • ...and 4 more