Table of Contents
Fetching ...

When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

Jianing Geng, Biao Yi, Zekun Fei, Tongxi Wu, Lihai Nie, Zheli Liu

TL;DR

To address jailbreak vulnerabilities in safety-aligned LLMs, the paper reframes attacks through a stealth lens and shows existing methods fail to simultaneously hide toxicity and preserve natural language. It presents StegoAttack, a two-stage framework that uses steganography to conceal a harmful query within benign text and prompts the model to encrypt its outputs. The authors provide a formal Steganography Principles and show that neutral hidden scenes improve concealment, coupled with a feedback-driven prompt refinement to boost robustness across models. Empirical results on four safety-aligned LLMs and three external detectors demonstrate high bypass and attack success rates (ASR) while maintaining stealth, exposing gaps in current defenses and guiding future defense design.

Abstract

Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at https://anonymous.4open.science/r/StegoAttack-Jail66

When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

TL;DR

To address jailbreak vulnerabilities in safety-aligned LLMs, the paper reframes attacks through a stealth lens and shows existing methods fail to simultaneously hide toxicity and preserve natural language. It presents StegoAttack, a two-stage framework that uses steganography to conceal a harmful query within benign text and prompts the model to encrypt its outputs. The authors provide a formal Steganography Principles and show that neutral hidden scenes improve concealment, coupled with a feedback-driven prompt refinement to boost robustness across models. Empirical results on four safety-aligned LLMs and three external detectors demonstrate high bypass and attack success rates (ASR) while maintaining stealth, exposing gaps in current defenses and guiding future defense design.

Abstract

Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at https://anonymous.4open.science/r/StegoAttack-Jail66

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Harmful queries are blocked by LLMs. AutoDAN improves linguistic stealth but is detected for malicious intent. Cipher enhances toxic stealth but produces unnatural text. We propose StegoAttack, a fully stealthy method using steganography to preserve both stealth, evade detection, and achieve high ASR.
  • Figure 2: Overview of our StegoAttack. Step A: The harmful query is transformed into a jailbreak prompt. In part one, an LLM hides the harmful query using acrostic steganography to generate a benign, natural-looking sentence. In part two, prompt components tailored to specific capabilities are constructed. Step B: Failure cases are analyzed to identify causes, and prompt parameters are refined dynamically based on feedback.
  • Figure 3: Detailed Template of StegoAttack. The hidden sentence generated by steganography is embedded within the second segment of the template.
  • Figure 4: Comparative analysis of four hidden metrics on GPT-o3 versus eight baselines. For clarity of the experimental results, we adjusted the coordinate distribution: the lower the position, the better the performance.
  • Figure 5: Ablation studies of StegoAttack over three parameters. (a) Steganographic Embedding Position. Embedding at first word yields the highest ASR with minimal iterations. (b) Steganographic Carrier Scenario. Six scenarios are divided into three sentiment orientations (neutral, positive, negative), with the neutral scenarios achieving a higher ASR in fewer iterations. (c) Maximum Iteration Limit. ASR improves as iterations increase, until saturation. To accelerate the experiments, we set the maximum number of iterations to 3 instead of 6 in experiments (b).