Table of Contents
Fetching ...

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova

TL;DR

Benign fine-tuning can degrade safety guardrails, a phenomenon not explained by static, orthogonality-based theories. The authors develop Alignment Instability Condition (AIC), a curvature-driven geometric framework showing alignment concentrates in low-dimensional, high-curvature subspaces and that second-order gradient dynamics steer training into safety-sensitive regions, yielding a rapid (quartic) growth of misalignment. They validate the theory with experiments showing low-rank Fisher structure and a geometric overlap score that predicts misalignment across tasks, including seemingly benign ones. The work argues for curvature-aware safe-fine-tuning strategies and predictive diagnostics for deployment of open-weight models, shifting safety analysis from reactive red-teaming to proactive geometric diagnostics.

Abstract

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

TL;DR

Benign fine-tuning can degrade safety guardrails, a phenomenon not explained by static, orthogonality-based theories. The authors develop Alignment Instability Condition (AIC), a curvature-driven geometric framework showing alignment concentrates in low-dimensional, high-curvature subspaces and that second-order gradient dynamics steer training into safety-sensitive regions, yielding a rapid (quartic) growth of misalignment. They validate the theory with experiments showing low-rank Fisher structure and a geometric overlap score that predicts misalignment across tasks, including seemingly benign ones. The work argues for curvature-aware safe-fine-tuning strategies and predictive diagnostics for deployment of open-weight models, shifting safety analysis from reactive red-teaming to proactive geometric diagnostics.

Abstract

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.
Paper Structure (38 sections, 23 theorems, 68 equations, 5 figures, 2 tables)

This paper contains 38 sections, 23 theorems, 68 equations, 5 figures, 2 tables.

Key Result

Theorem 1.1

Movement in the alignment-sensitive subspace $M_i\subseteq \mathbb{R}^n$ incurs quadratic utility loss, $\Omega(\lambda\|P_i(\Delta\theta)\|^2)$ where $\lambda$ is the minimum curvature in $M_i$ and $P_i$ projects onto this subspace.

Figures (5)

  • Figure 1: Local alignment instability under fine-tuning dynamics. An illustrative loss landscape showing a fine-tuning trajectory (black arrow) evolving near a ridge separating basins of low alignment utility. Although the initial gradient direction is nearly tangent to the ridge, induced acceleration toward a high-curvature alignment-sensitive direction (red/blue arrow), leads to departure.
  • Figure 2: Top eigenvalues of FIM approximated over 100 random samples from BeaverTail's safe subset. Each subplot consists of multiple lines in different transparency levels for different layer indices, with each layer showing a similar low-rank structure.
  • Figure 3: Average Overlap Score per Transformer Block of 7 Fine-Tuning Datasets. Each value represents the average OS of all modules in the block.
  • Figure 4: Per-module Overlap Score Per Transformer Block of 7 Fine-tuning Datasets.
  • Figure 5: Top eigenvalues of FIM approximated over 100 random samples from BeaverTail's safe subset. Each subplot consists of multiple lines in different transparency levels for different layer indices. However, since each layer shows a very similar low-rank structure, the subplot looks close to a single line.

Theorems & Definitions (44)

  • Theorem 1.1: Informal Version of Theorem \ref{['thm:util-bound']}
  • Theorem 1.2: Informal Version of Theorem \ref{['thm:projection']}
  • Corollary 1.3: Informal Version of Corollary \ref{['cor:quartic_onset']}
  • Definition 3.1: Alignment Skill and Utility
  • Lemma 3.2: Alignment Loss as KL Divergence
  • Proposition 3.3: Local Geometric Form
  • Definition 3.4: Alignment Sensitivity Subspaces
  • Theorem 4.1: Benign Fine-Tuning under Flat Geometry
  • Definition 5.1: Alignment Instability Condition
  • Theorem 6.1
  • ...and 34 more