Table of Contents
Fetching ...

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong

TL;DR

The paper reframes LLM safety alignment as an unobserved confounder problem and introduces CFA$^2$, a two-stage framework based on Pearl's Front-Door Criterion. By identifying an observable mediator $S$ with Sparse Autoencoders and applying weight orthogonalization to remove the defense subspace, CFA$^2$ achieves a deterministic, $O(1)$-complexity jailbreak that preserves task semantics. Empirically, CFA$^2$ delivers a $83.68\%$ average Attack Success Rate across multiple model families, with strong robustness on highly aligned models and substantially better efficiency and fluency than prior optimization-based methods. The work highlights a mechanistic, causal pathway for jailbreaking, offering insights for both offensive capability and defensive mitigation in LLM safety systems.

Abstract

Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

TL;DR

The paper reframes LLM safety alignment as an unobserved confounder problem and introduces CFA, a two-stage framework based on Pearl's Front-Door Criterion. By identifying an observable mediator with Sparse Autoencoders and applying weight orthogonalization to remove the defense subspace, CFA achieves a deterministic, -complexity jailbreak that preserves task semantics. Empirically, CFA delivers a average Attack Success Rate across multiple model families, with strong robustness on highly aligned models and substantially better efficiency and fluency than prior optimization-based methods. The work highlights a mechanistic, causal pathway for jailbreaking, offering insights for both offensive capability and defensive mitigation in LLM safety systems.

Abstract

Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
Paper Structure (38 sections, 2 theorems, 19 equations, 5 figures, 5 tables)

This paper contains 38 sections, 2 theorems, 19 equations, 5 figures, 5 tables.

Key Result

Proposition 4.1

Let the data generating process be $\bm{x} = \bm{g}(\bm{z})$, where $\bm{g}: \mathcal{Z} \to \mathcal{X}$ is a smooth diffeomorphism and $\bm{z} \in \mathbb{R}^{d_z}$ are latent factors. Assume the prior $p(\bm{z}|y)$ conditioned on the response $y$ follows a conditionally factorial exponential fami

Figures (5)

  • Figure 1: Jailbreaking SCM. $X$ represents arbitrary query. A is the representation embedding of $X$ in LLM. $S$ denotes main semantics inherent in $X$. $Y$ represents the harmful response. $U$ represents an unobservable internal safety mechanism of the LLM.
  • Figure 2: Overview of the Causal Front-Door Adjustment Attack (CFAA) framework. The method operates in two phases: Identification of the Front-Door Mediator. We analyze latent activations using paired contrastive samples: original harmful queries $\mathcal{D}_{\text{harm}}$ (triggering refusal) and their jailbroken variants $\mathcal{D}_{\text{attack}}$ (inducing compliance), which share identical task intent. By filtering style-variant features (representing the defense mechanism $U$), we isolate the defense direction vector $\mathbf{d}$. Operationalizing Front-Door Adjustment. We structurally sever the causal link from the safety mechanism by projecting the original output weights $\mathbf{W}_{\text{out}}$ onto the orthogonal complement of $\mathbf{d}$. This structural intervention transforms the theoretical marginalization into an efficient $O(1)$ generation process using purified weights $\mathbf{W}_{\text{out}}^{\text{new}}$, allowing the model to bypass safety guardrails while preserving task intent $S$.
  • Figure 3: Validation of Disentanglement (Proposition 4.2). (a) Evidence of Style Variance: The activation distribution of $\bm{u}$ shows a distinct causal shift between the refusal state ($\mathcal{D}_{harm}$) and compliance state ($\mathcal{D}_{attack}$), confirming $\bm{u}(\bm{x}) \neq \bm{u}(\bm{x}^+)$. (b) Evidence of Content Invariance: While the defense mechanism (red star) acts as an outlier, the core task semantics (gray points, $\mathbf{s}$) remain aligned along the invariance line ($y=x$), confirming $\bm{s}(\bm{x}) \approx \bm{s}(\bm{x}^+)$.
  • Figure 4: Impact of Generation Length on ASR. We compare the stability of different intervention strategies as the target generation length increases (0-150 tokens). While activation clamping (red) shows a moderate initial ASR that degrades rapidly due to Defense Restoration, CFA$^2$ maintains a high success rate, demonstrating robustness against long-context safety recovery mechanisms.
  • Figure 5: Hyperparameter Sensitivity Analysis. We evaluate the ASR as a function of the number of Top-$k$ SAE features. The results show a wide stability plateau for $k \in [10, 50]$, confirming the robustness of CFA$^2$.

Theorems & Definitions (4)

  • Proposition 4.1: Identifiability of Latent Causal Factors
  • Proposition 4.2: Identifiability via Contrastive Intervention
  • proof
  • proof