Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, Wanxiang Che
TL;DR
This work addresses safety in large language models by introducing ShadowBreak, a stealthy jailbreak framework that uses benign data mirroring to train a local mirror model aligned to a target black-box LLM. The two-stage method first constructs a mirror via benign data distillation and then performs an aligned transfer attack using white-box jailbreak techniques to craft prompts that transfer to the target with reduced risk of detection. The authors formalize attack objectives and dual ASR metrics ($ASR_M$ and $ASR_C$) alongside stealth measures ($Q$, $Q^!$) and demonstrate strong performance (up to 92% ASR) with low detectable-query counts, while highlighting the critical role of alignment data mix. They discuss ethical implications and defensive avenues, emphasizing the need for robust, adaptive defenses to mitigate these stealthy transfer attacks in real-world deployments.
Abstract
Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the search phase. Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore the need for more robust defense mechanisms.
