LLM Safety Alignment is Divergence Estimation in Disguise
Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing
TL;DR
This work reframes LLM safety alignment as divergence estimation between aligned and unaligned response distributions, unifying RLHF, DPO, KTO, and BCO under a single probabilistic lens. It introduces KLDO, a KL-divergence-based alignment objective, and a broader FDO family, providing theoretical guarantees via alignment-consistency and linking these to the separation of safe vs. harmful prompts in latent space. The authors show that CR data yield stronger separation and robustness than Pref data, and they validate these predictions with extensive experiments across multiple model families, demonstrating that greater latent separation correlates with improved safety. The framework offers a principled path to designing new alignment objectives that improve safety without sacrificing utility, with implications for scalable, robust alignment in real-world systems.
Abstract
We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.
