GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin; Ruoxi Chen; Peiyan Zhang; Andy Zhou; Haohan Wang

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Haohan Wang

TL;DR

GUARD presents a scalable, four-role LLM framework for generating natural-language jailbreak prompts to test adherence to safety guidelines. It leverages a knowledge-graph representation of eight jailbreak characteristics and a Random Walk-based scenario generator, integrated with three workflow blocks to translate guidelines, create playing scenarios, and optimize jailbreak prompts. Extensive experiments across open-source and commercial LLMs, plus vision-language models, show high jailbreak effectiveness and transferability, with ablations confirming the necessity of each role and the KG-based approach. The work advances proactive, cross-modal safety testing for AI systems and informs defenses and governance for safer LLM-powered applications.

Abstract

The discovery of "jailbreaks" to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing LLMs to generate unethical or guideline-violating responses. In addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. We refer to our system as GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

TL;DR

Abstract

Paper Structure (44 sections, 9 figures, 13 tables, 1 algorithm)

This paper contains 44 sections, 9 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Methodology
Problem Definition
Overview
Guided Question Prompt Generation
Jailbreak Categorization and Scenario Setup
Jailbreak Collection and Categorization
Playing Scenario Generation
Role-playing for Scenario Optimization
Experiments
Experimental Setup
Effectiveness on Jailbreaking LLMs
Direct jailbreaking effectiveness
Transferred jailbreaking effectiveness
...and 29 more sections

Figures (9)

Figure 1: Overall pipeline of GUARD. including generating question prompts, setting playing scenarios, assessing prompts, and improving jailbreak prompts, all achieved by four role-playing LLMs - Translator, Generator, Evaluator, and Optimizer.
Figure 2: Jailbreak success rate with different role-playing models.
Figure 3: Jailbreak results on percentages of pre-collected jailbreaks.
Figure 4: Step1: guided question prompt generation.
Figure 5: Step2: guided question prompt generation.
...and 4 more figures

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

TL;DR

Abstract

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)