Table of Contents
Fetching ...

Purple-teaming LLMs with Adversarial Defender Training

Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, Helen Meng

TL;DR

This paper introduces PAD, a adaptive purple-teaming framework that jointly trains red-teaming attackers and blue-teaming defenders to safeguard LLMs against unsafe content. It integrates a base LLM with LoRA-tuned attacker and defender modules and leverages ShieldLM as a safety Judge to provide labels and explanations, enabling GAN-style iterative updates. Empirical results show that PAD improves both the attacker’s ability to reveal vulnerabilities and the defender’s capability to filter unsafe outputs, while maintaining overall generation quality, even in multi-turn and zero-resource settings. The work also provides a thorough error analysis, highlighting limitations in discrimination gaps for certain safety rules and proposing directions such as per-rule Judge models, more turns, and generalization to other open-source LLMs. Overall, PAD offers a data-efficient, adaptive pathway to safer LLMs in dynamic safety landscapes with practical implications for deployment and evaluation.

Abstract

Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we automatically collect conversational data that cover the vulnerabilities of an LLM around specific safety risks in a self-play manner, where the attacker aims to elicit unsafe responses and the defender generates safe responses to these attacks. We then update both modules in a generative adversarial network style by training the attacker to elicit more unsafe responses and updating the defender to identify them and explain the unsafe reason. Experimental results demonstrate that PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail. Furthermore, our findings indicate that PAD excels in striking a balance between safety and overall model quality. We also reveal key challenges in safeguarding LLMs, including defending multi-turn attacks and the need for more delicate strategies to identify specific risks.

Purple-teaming LLMs with Adversarial Defender Training

TL;DR

This paper introduces PAD, a adaptive purple-teaming framework that jointly trains red-teaming attackers and blue-teaming defenders to safeguard LLMs against unsafe content. It integrates a base LLM with LoRA-tuned attacker and defender modules and leverages ShieldLM as a safety Judge to provide labels and explanations, enabling GAN-style iterative updates. Empirical results show that PAD improves both the attacker’s ability to reveal vulnerabilities and the defender’s capability to filter unsafe outputs, while maintaining overall generation quality, even in multi-turn and zero-resource settings. The work also provides a thorough error analysis, highlighting limitations in discrimination gaps for certain safety rules and proposing directions such as per-rule Judge models, more turns, and generalization to other open-source LLMs. Overall, PAD offers a data-efficient, adaptive pathway to safer LLMs in dynamic safety landscapes with practical implications for deployment and evaluation.

Abstract

Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we automatically collect conversational data that cover the vulnerabilities of an LLM around specific safety risks in a self-play manner, where the attacker aims to elicit unsafe responses and the defender generates safe responses to these attacks. We then update both modules in a generative adversarial network style by training the attacker to elicit more unsafe responses and updating the defender to identify them and explain the unsafe reason. Experimental results demonstrate that PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail. Furthermore, our findings indicate that PAD excels in striking a balance between safety and overall model quality. We also reveal key challenges in safeguarding LLMs, including defending multi-turn attacks and the need for more delicate strategies to identify specific risks.
Paper Structure (40 sections, 2 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 2 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The training paradigm of purple-teaming LLMs with Adversarial Defender Training.
  • Figure 2: Cases of conversations between attackers and defenders.
  • Figure 3: System prompts for different tasks.
  • Figure 4: Head-to-head comparison on overall quality of safe responses from PADv3 v.s. Base.
  • Figure 5: Breakdown of ASR on different turns.
  • ...and 1 more figures