Table of Contents
Fetching ...

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra, Jinghui Chen

TL;DR

This work reveals a fundamental vulnerability in safety-alignment pipelines by showing that simultaneous obfuscation of queries and responses can bypass guardrails in leading LLMs. It introduces WordGame and WordGame+ to implement query obfuscation via word games and response obfuscation via auxiliary tasks and questions, achieving high jailbreak success with strong efficiency. Extensive experiments on six models with AdvBench demonstrate superior effectiveness, especially against Claude 3, GPT-4, and Llama-3, and ablations illustrate the synergetic benefits of combined obfuscation. The findings highlight the need for more robust safety frameworks and propose red-teaming tools to better evaluate and enhance LLM safety against adaptive obfuscation strategies.

Abstract

The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized production processes at an unprecedented pace. Alongside this progress also comes mounting concerns about LLMs' susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content. While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect. In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the anticipated harmful content in the response, creating a context that is hardly covered by any corpus used for safety alignment. Extensive experiments demonstrate that WordGame attack can break the guardrails of the current leading proprietary and open-source LLMs, including the latest Claude-3, GPT-4, and Llama-3 models. Further ablation studies on such simultaneous obfuscation in query and response provide evidence of the merits of the attack strategy beyond an individual attack.

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

TL;DR

This work reveals a fundamental vulnerability in safety-alignment pipelines by showing that simultaneous obfuscation of queries and responses can bypass guardrails in leading LLMs. It introduces WordGame and WordGame+ to implement query obfuscation via word games and response obfuscation via auxiliary tasks and questions, achieving high jailbreak success with strong efficiency. Extensive experiments on six models with AdvBench demonstrate superior effectiveness, especially against Claude 3, GPT-4, and Llama-3, and ablations illustrate the synergetic benefits of combined obfuscation. The findings highlight the need for more robust safety frameworks and propose red-teaming tools to better evaluate and enhance LLM safety against adaptive obfuscation strategies.

Abstract

The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized production processes at an unprecedented pace. Alongside this progress also comes mounting concerns about LLMs' susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content. While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect. In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the anticipated harmful content in the response, creating a context that is hardly covered by any corpus used for safety alignment. Extensive experiments demonstrate that WordGame attack can break the guardrails of the current leading proprietary and open-source LLMs, including the latest Claude-3, GPT-4, and Llama-3 models. Further ablation studies on such simultaneous obfuscation in query and response provide evidence of the merits of the attack strategy beyond an individual attack.
Paper Structure (23 sections, 1 equation, 9 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of our proposed WordGame attack.
  • Figure 2: (a): Typical example of existing jailbreaking attacks; (b): The example of query obfuscation in WordGame.
  • Figure 3: Example of a full jailbreaking prompt and the corresponding response by Claude 3, both partitioned according to auxiliary questions, task and malicous request.
  • Figure 4: Example of WordGame+ successfully jailbreaking Llama 2
  • Figure 5: Example of WordGame+ successfully jailbreaking Llama 3
  • ...and 4 more figures