Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs
Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong
TL;DR
The paper reframes LLM safety alignment as an unobserved confounder problem and introduces CFA$^2$, a two-stage framework based on Pearl's Front-Door Criterion. By identifying an observable mediator $S$ with Sparse Autoencoders and applying weight orthogonalization to remove the defense subspace, CFA$^2$ achieves a deterministic, $O(1)$-complexity jailbreak that preserves task semantics. Empirically, CFA$^2$ delivers a $83.68\%$ average Attack Success Rate across multiple model families, with strong robustness on highly aligned models and substantially better efficiency and fluency than prior optimization-based methods. The work highlights a mechanistic, causal pathway for jailbreaking, offering insights for both offensive capability and defensive mitigation in LLM safety systems.
Abstract
Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
