Table of Contents
Fetching ...

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Haohan Wang

TL;DR

GUARD addresses the gap between high-level government guidelines and practical testing by translating guidelines into guideline-violating questions and coupling this with jailbreak diagnostics (GUARD-JD). The method uses adaptive role-playing LLMs—Analyst, Strategic Committee, Question Designer, and Question Reviewer—to produce diverse, guideline-focused prompts, while a knowledge-graph–driven jailbreak factory generates realistic scenarios; a trio of roles—Generator, Evaluator, Optimizer—refines playing scenarios to expose non-compliance, quantified by a semantic similarity metric. Empirically, GUARD demonstrates robust guideline-upholding and jailbreak-diagnostic performance across multiple LLMs and transfers—into vision-language models—surpassing several baselines and receiving positive human validation. The work provides a scalable framework for regulatory-aligned safety testing of LLMs with practical implications for deploying safer AI in real-world applications.

Abstract

As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

TL;DR

GUARD addresses the gap between high-level government guidelines and practical testing by translating guidelines into guideline-violating questions and coupling this with jailbreak diagnostics (GUARD-JD). The method uses adaptive role-playing LLMs—Analyst, Strategic Committee, Question Designer, and Question Reviewer—to produce diverse, guideline-focused prompts, while a knowledge-graph–driven jailbreak factory generates realistic scenarios; a trio of roles—Generator, Evaluator, Optimizer—refines playing scenarios to expose non-compliance, quantified by a semantic similarity metric. Empirically, GUARD demonstrates robust guideline-upholding and jailbreak-diagnostic performance across multiple LLMs and transfers—into vision-language models—surpassing several baselines and receiving positive human validation. The work provides a scalable framework for regulatory-aligned safety testing of LLMs with practical implications for deploying safer AI in real-world applications.

Abstract

As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

Paper Structure

This paper contains 45 sections, 14 figures, 14 tables, 1 algorithm.

Figures (14)

  • Figure 1: Examples of GUARD generating questions from high-level guidelines to produce guideline-violating responses and perform jailbreak diagnostics. a) A human rights rule from the EU's Trustworthy AI Guidelines. (b) Guideline-violating questions generated by GUARD prompt harmful content, revealing non-compliance. (c) For refusal responses, jailbreak diagnostics uncover scenarios where LLMs fail to adhere to guidelines.
  • Figure 2: Overall pipeline of GUARD, including generating guideline-violating questions shown in the grey block, and focuses on jailbreak diagnostics, shown in the remaining block, focuses on jailbreak diagnostics All are achieved by adaptive role-playing LLMs.
  • Figure 3: Human validation on guideline
  • Figure 4: Human validation on semantic similarity and harmfulness.
  • Figure 5: Step 1: Identifying and organizing principles and conflicts from a rule.
  • ...and 9 more figures