Table of Contents
Fetching ...

SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters

Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Xuetao Wei, Hailiang Huang, Guanhua Chen, Yun Chen

TL;DR

SeqAR tackles the risk of jailbreaking large language models by automating red-teaming through sequential auto-generated jailbreak characters. It leverages a greedy optimization loop driven by an attacker LLM and a DeBERTaV3-based judgement model to produce universal jailbreak prompts that bypass guardrails across multiple models, including GPT-4, with enhancements from long-tail encoding. The approach achieves high attack success rates across seven prominent LLMs and demonstrates transferability across queries and targets, while enabling combination with other attack methods. These findings reveal both the potential and limitations of automated jailbreaks and motivate robust defense strategies to ensure safer LLM deployment.

Abstract

The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SeqAR, a simple yet effective framework to design jailbreak prompts automatically. The SeqAR framework generates and optimizes multiple jailbreak characters and then applies sequential jailbreak characters in a single query to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SeqAR can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SeqAR achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SeqAR.

SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters

TL;DR

SeqAR tackles the risk of jailbreaking large language models by automating red-teaming through sequential auto-generated jailbreak characters. It leverages a greedy optimization loop driven by an attacker LLM and a DeBERTaV3-based judgement model to produce universal jailbreak prompts that bypass guardrails across multiple models, including GPT-4, with enhancements from long-tail encoding. The approach achieves high attack success rates across seven prominent LLMs and demonstrates transferability across queries and targets, while enabling combination with other attack methods. These findings reveal both the potential and limitations of automated jailbreaks and motivate robust defense strategies to ensure safer LLM deployment.

Abstract

The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SeqAR, a simple yet effective framework to design jailbreak prompts automatically. The SeqAR framework generates and optimizes multiple jailbreak characters and then applies sequential jailbreak characters in a single query to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SeqAR can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SeqAR achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SeqAR.
Paper Structure (43 sections, 16 figures, 15 tables)

This paper contains 43 sections, 16 figures, 15 tables.

Figures (16)

  • Figure 1: A simplified example of jailbreaking LLMs with SeqAR.
  • Figure 2: The overview of SeqAR framework. (Left): The jailbreak attack template of SeqAR. The character descriptions are generated and optimized in a greedy manner. During character optimization, the r$^{\text{th}}$ character is generated and evaluated with the previous r-1 characters fixed. The model response starts with these sentences. (Right): The automated generation and optimization process of the r$^{\text{th}}$ jailbreak character. The meta prompt is used to generate character descriptions with an attacker LLM. More details are in Section \ref{['sec:character-opt']}.
  • Figure 3: ASR results with the number of examples in meta prompt.
  • Figure 4: Visualization of malicious request-level jailbreak attack results. We compare three methods: SeqAR, SeqAR-divided, SeqAR-reversed. A and B are the first and second characters found by SeqAR.
  • Figure 5: Prompt for GPT-based judgement model.
  • ...and 11 more figures