Table of Contents
Fetching ...

Fail-Closed Alignment for Large Language Models

Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong

TL;DR

The authors' mechanistic analyses confirm that models trained with the proposed fail-closed alignment framework encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.

Abstract

We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.

Fail-Closed Alignment for Large Language Models

TL;DR

The authors' mechanistic analyses confirm that models trained with the proposed fail-closed alignment framework encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.

Abstract

We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant featurevia prompt-based jailbreakscan cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.
Paper Structure (27 sections, 7 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Refusal in current LLMs is dependent on a single dominant feature that jailbreaks suppress.(Left) ASR on 200 harmful prompts before and after ablating the DIM refusal feature across four models. Ablation causes ASR to spike, showing that refusal depends on a single linear direction. (Right) Average cosine similarity of the refusal feature in the activation space on the harmful prompts under different jailbreaks. All attacks substantially reduce activation of the refusal feature, indicating they largely succeed by suppressing the same internal mechanism. Details on the prompts, prompt-based jailbreaks, and ASR calculation are provided in §\ref{['subsec:setup']}.
  • Figure 2: Effectiveness of our method w/ LoRA on Gemma2-9B.
  • Figure 3: Causal effects and activations of refusal features identified by our method on Llama3-8B. For each feature discovered during training ($r_1,\dots,r_{10}$): (Left) ASR on harmful prompts when ablating this feature together with all previously identified features, and CR on harmless prompts when adding the feature, shown both for our method and when MFA is replaced with single-feature ablation; (Right) average cosine similarity of the feature with the model's activations on harmful prompts under different prompt-based jailbreak attacks. Each feature is projected onto the orthogonal complement of the span of all previously features to isolate individual contributions.
  • Figure 4: Sensitivity of our method's key training configurations on average ASR, CR, and Acc. for Gemma2-2B.
  • Figure 5: Effectiveness of our method with LoRA on Llama3-8B. The average ASR, CR, and Acc. across evaluation benchmarks.
  • ...and 1 more figures