Table of Contents
Fetching ...

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Shang Shang, Xinqiang Zhao, Zhongjiang Yao, Yepeng Yao, Liya Su, Zijing Fan, Xiaodan Zhang, Zhengwei Jiang

TL;DR

The paper investigates prompt-based jailbreaks in large language models and proposes IntentObfuscator, a lightweight framework that obfuscates user intent to bypass content defenses. It introduces two concrete attack modes, Obscure Intention (OI) and Create Ambiguity (CA), and validates them across four commercial models, reporting strong jailbreak success rates and category-wide effectiveness. The work provides a theoretical foundation for prompt obfuscation, presents automated generation pipelines, and discusses mitigation strategies, highlighting implications for red-team testing and model security. Overall, IntentObfuscator demonstrates how obfuscation and ambiguity can facilitate jailbreaks, underscoring the need for robust, scalable defenses in real-world LLM deployments.

Abstract

To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65\%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

TL;DR

The paper investigates prompt-based jailbreaks in large language models and proposes IntentObfuscator, a lightweight framework that obfuscates user intent to bypass content defenses. It introduces two concrete attack modes, Obscure Intention (OI) and Create Ambiguity (CA), and validates them across four commercial models, reporting strong jailbreak success rates and category-wide effectiveness. The work provides a theoretical foundation for prompt obfuscation, presents automated generation pipelines, and discusses mitigation strategies, highlighting implications for red-team testing and model security. Overall, IntentObfuscator demonstrates how obfuscation and ambiguity can facilitate jailbreaks, underscoring the need for robust, scalable defenses in real-world LLM deployments.

Abstract

To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65\%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.
Paper Structure (39 sections, 27 equations, 10 figures, 6 tables, 3 algorithms)

This paper contains 39 sections, 27 equations, 10 figures, 6 tables, 3 algorithms.

Figures (10)

  • Figure 1: IntentObfuscator Jailbreak attack threat model.
  • Figure 2: The Overview of OI jailbreak.
  • Figure 3: A case of OI jailbreak.
  • Figure 4: Comparison of syntax tree before and after editing
  • Figure 5: The Overview of CA jailbreak.
  • ...and 5 more figures