Table of Contents
Fetching ...

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua

TL;DR

This work tackles robustness gaps in LLM safety alignment by shifting focus from data-driven robustness to optimization geometry. It introduces ShaPO, a selective geometry control framework that restricts adversarial perturbations to a safety-critical subspace identified via probe-based signals, improving stability of preference-based alignment under domain shift and noisy supervision. ShaPO has two instantiations—token-level and reward-level—that align optimization with either token likelihood surrogates or semantic reward signals, and it exhibits strong IID performance, superior OOD safety robustness, and compatibility with data-centric robustness methods. The findings suggest that addressing optimization geometry provides meaningful, orthogonal gains in safety robustness and can be effectively composed with existing data-centric approaches to yield additive improvements in real-world deployment.

Abstract

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose ShaPO, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective.

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

TL;DR

This work tackles robustness gaps in LLM safety alignment by shifting focus from data-driven robustness to optimization geometry. It introduces ShaPO, a selective geometry control framework that restricts adversarial perturbations to a safety-critical subspace identified via probe-based signals, improving stability of preference-based alignment under domain shift and noisy supervision. ShaPO has two instantiations—token-level and reward-level—that align optimization with either token likelihood surrogates or semantic reward signals, and it exhibits strong IID performance, superior OOD safety robustness, and compatibility with data-centric robustness methods. The findings suggest that addressing optimization geometry provides meaningful, orthogonal gains in safety robustness and can be effectively composed with existing data-centric approaches to yield additive improvements in real-world deployment.

Abstract

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose ShaPO, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective.
Paper Structure (49 sections, 25 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 49 sections, 25 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Cumulative contribution to worst-case alignment loss under parameter perturbations. We compare the fraction of the total worst-case loss increase accounted for by perturbing probe-identified safety-critical neurons (Top-K) versus randomly selected neurons of the same size (Random-K).
  • Figure 2: Overview of ShaPO , a robust preference optimization framework with selective geometry control. ShaPO minimizes the worst-case alignment loss under adversarial perturbations restricted to the alignment-critical parameter subspace. Here is the instantiation of ShaPO at reward level.
  • Figure 3: Sensitivities to noise environments, we compare the win rate of all methods across different preference data flipping ratios (10%, 20%, 40%). The left is the results on Pythia-2.8B, and the right is on LLaMA-3.2-3B backbone.
  • Figure 4: Composability of ShaPO with DPO and other data-centric alignment methods. We report the Win Rate compared with the chosen response; the left is comparisons on Pythia-2.8B, and the right is on the LLaMA-3.2-3B backbone.
  • Figure 5: Reward-score distribution on the PKU-30K training set and the effect of different $\beta_r$ on score normalization. Left: the raw score difference $\Delta r = r(x, y^{w}) - r(x, y^{l})$ produced by the Beaver reward (negated cost) model on preference pairs. Right three: the corresponding sigmoid-transformed values $\sigma(\beta_r \Delta r)$ under $\beta_r \in \{0.1, 1, 10\}$.
  • ...and 2 more figures