Table of Contents
Fetching ...

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

CoP presents an agentic, principle-based red-teaming framework that composes multiple human-provided jailbreak principles to autonomously craft effective jailbreak prompts. Through a dual-judge evaluation and iterative refinement, CoP achieves state-of-the-art single-turn attack performance and substantial query-efficiency gains across open-source and commercial LLMs, including highly aligned models. The results reveal systemic safety weaknesses and offer a transparent, extensible tool for proactive safety testing, while highlighting avenues for defense augmentation and broader applicability beyond jailbreak testing. Overall, CoP advances automated, scalable, and interpretable red-teaming methodologies for robust LLM safety evaluation.

Abstract

Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

TL;DR

CoP presents an agentic, principle-based red-teaming framework that composes multiple human-provided jailbreak principles to autonomously craft effective jailbreak prompts. Through a dual-judge evaluation and iterative refinement, CoP achieves state-of-the-art single-turn attack performance and substantial query-efficiency gains across open-source and commercial LLMs, including highly aligned models. The results reveal systemic safety weaknesses and offer a transparent, extensible tool for proactive safety testing, while highlighting avenues for defense augmentation and broader applicability beyond jailbreak testing. Overall, CoP advances automated, scalable, and interpretable red-teaming methodologies for robust LLM safety evaluation.

Abstract

Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.

Paper Structure

This paper contains 33 sections, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: The overall system illustration of the Composition-of-Principles (CoP) agentic red-teaming pipeline consisting of three integral components. Part (a) entails the overall pipeline of CoP. The original query will be forwarded to a Red-teaming Agent, which is an LLM-based agent that automatically orchestrates different compositions of principles to generate jailbreak prompts and elicit the undesired behaviors from Target LLM based on human-provided jailbreak principles. Subsequently, the Judge LLM evaluates the Target LLM's response on a quantitative scale 1-10 to determine the efficacy of the jailbreak attempt. Concurrently, a similarity assessment is conducted between the jailbreak prompt and the original query to ensure preservation of the intended objective. Should the jailbreak attempt prove unsuccessful, the system initiates a feedback loop to the Red-teaming Agent for enhanced jailbreak prompt generation. Part (b) entails the deployment of the Red-teaming Agent, which firstly transforms the original harmful query into the initial jailbreak prompt $P_\text{init}$ by utilizing the initial seed prompt generator, then processes both $P_\text{init}$ and the set of human-provided principles. Leveraging its comprehensive knowledge base, the Red-teaming Agent strategically synthesizes various principles to construct an optimized jailbreak prompt. In part (c), we present a comprehensive illustration of the CoP pipeline's iterative functionality through a case study demonstration.
  • Figure 2: Key results: CoP shows the advanced ability in terms of performance between our CoP and the state-of-the-art single-turn jailbreak attacks.
  • Figure 3: Attack Success Rate (ASR) comparisons among different jailbreak attack methods and target models evaluated on 400 Harmbench queries and Harmbench classifier. (a) Open-Sourced LLMs: Llama and Gemma models. (b) Closed-Sourced LLMs: Gemini Pro 1.5, GPT-4-1106-Preview, O1 and Claude-3.5-Sonnet. Overall, CoP consistently outperforms all baselines.
  • Figure 4: (a) Distribution plot upon counting successful CoP jailbreak strategies (composition of principles) on 6 different LLMs. (b) Top-3 Distribution plot upon counting successful CoP jailbreak strategies on OpenAI O1. (c) Top-3 Distribution plot upon counting successful CoP jailbreak strategies on Claude-3.5 Sonnet.
  • Figure 5: CoP performance on safety-enhanced model Llama-3-8B-Instruct-RR. From the pie chart we can conclude that CoP is the best jailbreak method among all baselines
  • ...and 4 more figures