Injecting Universal Jailbreak Backdoors into LLMs in Minutes
Zhuowei Chen, Qiannan Zhang, Shichao Pei
TL;DR
This paper addresses the risk of jailbreak backdoors in safety-aligned LLMs by introducing JailbreakEdit, a model-editing-based method that implants a universal backdoor in minutes without data poisoning or full fine-tuning. It combines Trigger Representation Extraction and Multi-Node Target Estimation to derive a backdoor trigger and a robust target vector, enabling jailbreak outputs when activated while preserving normal safety otherwise. The key contributions include a closed-form FFN update strategy and a practical attack pipeline that outperforms adapted locate-then-edit baselines in both effectiveness and stealth, demonstrated across multiple open-source LLMs and toxic datasets. The work has significant implications for defense, as it shows how rapid, targeted edits can undermine safety mechanisms, underscoring the need for stronger defenses against post-training backdoor injections.
Abstract
Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal intervention in minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models' attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of JailbreakEdit, emphasizing the need for more advanced defense mechanisms in LLMs.
