Table of Contents
Fetching ...

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, Wanxiang Che

TL;DR

This work addresses safety in large language models by introducing ShadowBreak, a stealthy jailbreak framework that uses benign data mirroring to train a local mirror model aligned to a target black-box LLM. The two-stage method first constructs a mirror via benign data distillation and then performs an aligned transfer attack using white-box jailbreak techniques to craft prompts that transfer to the target with reduced risk of detection. The authors formalize attack objectives and dual ASR metrics ($ASR_M$ and $ASR_C$) alongside stealth measures ($Q$, $Q^!$) and demonstrate strong performance (up to 92% ASR) with low detectable-query counts, while highlighting the critical role of alignment data mix. They discuss ethical implications and defensive avenues, emphasizing the need for robust, adaptive defenses to mitigate these stealthy transfer attacks in real-world deployments.

Abstract

Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the search phase. Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore the need for more robust defense mechanisms.

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

TL;DR

This work addresses safety in large language models by introducing ShadowBreak, a stealthy jailbreak framework that uses benign data mirroring to train a local mirror model aligned to a target black-box LLM. The two-stage method first constructs a mirror via benign data distillation and then performs an aligned transfer attack using white-box jailbreak techniques to craft prompts that transfer to the target with reduced risk of detection. The authors formalize attack objectives and dual ASR metrics ( and ) alongside stealth measures (, ) and demonstrate strong performance (up to 92% ASR) with low detectable-query counts, while highlighting the critical role of alignment data mix. They discuss ethical implications and defensive avenues, emphasizing the need for robust, adaptive defenses to mitigate these stealthy transfer attacks in real-world deployments.

Abstract

Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the search phase. Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore the need for more robust defense mechanisms.

Paper Structure

This paper contains 49 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Unlike mainstream black-box attack methods that repeatedly probe the target model with malicious instructions, ShadowBreak reduces detection risk by conducting searches on a local mirror model. This mirror model is aligned using benign distillation data from the target model, a process designed to bypass content moderation. The resulting prompts are then transferred to the target model.
  • Figure 2: The ShadowBreak method involves sending benign queries to the target model and using its responses to locally fine-tune a mirror model. This aligned model is then used to generate attack triggers for harmful queries. Finally, these optimized triggers are applied to the target model.
  • Figure 3: This figure illustrates the relationship between alignment data and performance across harmful categories and models for ShadowBreak. The results are based on the StrongReject dataset souly2024a and demonstrate performance against GPT-3.5 Turbo. S, DD, NC, V, IG and HD represents sexual content, disinformation and deception, non-violent crimes, violence, illegal goods and services, hate and discrimination, respectively.
  • Figure 4: This figure illustrates the relationship between alignment data and performance across harmful categories and models for ShadowBreak. The results are based on the StrongReject dataset souly2024a and demonstrate performance against GPT-3.5 Turbo. S1-S6 represents sexual content, disinformation and deception, non-violent crimes, violence, illegal goods and services, hate and discrimination, respectively.