Table of Contents
Fetching ...

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-jin Liu, Zilong Zheng, Gao Huang

TL;DR

Diversity is crucial for robust LLM safety evaluation, but traditional automatic red-teaming methods overemphasize attack success rate, risking narrow prompt coverage and overfitting to safety proxies. The authors present DiveR-CT, which reframes red-teaming as constrained policy optimization and adds a dynamic, nearest-neighbor semantic diversity reward to adaptively cover the semantic space across history. Across extensive experiments, DiveR-CT achieves higher lexical and semantic diversity at controlled attack-success rates, produces safer and more effective data for blue-team tuning, and remains robust to changes in the unsafe-reward classifier and blue-model strength. This yields a scalable framework for red-teaming LLMs that can accelerate safe deployment in real-world settings.

Abstract

Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

TL;DR

Diversity is crucial for robust LLM safety evaluation, but traditional automatic red-teaming methods overemphasize attack success rate, risking narrow prompt coverage and overfitting to safety proxies. The authors present DiveR-CT, which reframes red-teaming as constrained policy optimization and adds a dynamic, nearest-neighbor semantic diversity reward to adaptively cover the semantic space across history. Across extensive experiments, DiveR-CT achieves higher lexical and semantic diversity at controlled attack-success rates, produces safer and more effective data for blue-team tuning, and remains robust to changes in the unsafe-reward classifier and blue-model strength. This yields a scalable framework for red-teaming LLMs that can accelerate safe deployment in real-world settings.

Abstract

Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.
Paper Structure (60 sections, 31 equations, 14 figures, 12 tables)

This paper contains 60 sections, 31 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Main Framework of DiveR-CT. The key components of DiveR-CT, focusing on: ⓐ casting automatic red teaming as a constrained policy optimization problem, allowing our policies greater flexibility by relaxing the maximization objective; and ⓑ the revamped dynamic semantic reward. For a generation at time $t+1$ that is close to the last, CRT Hong2024CuriositydrivenRF assigns a high reward, while DiveR-CT assigns a low k-NN reward, encouraging the policy to discover novel generations.
  • Figure 2: Comparison of Embeddings using PCA: Per-step Mean and Cumulative Mean of Embeddings. This figure highlights the evolution of generations in the embedding space by showing the cumulative average (gradient line) and the per-step average (scatter points) of the embeddings. DiveR-CT demonstrates more uniform coverage of attacks.
  • Figure 3: Overoptimization Testing with Test Safety Classifier. We evaluate the extent of overoptimization by employing a test safety classifier, DaNLP/da-electra-hatespeech-detection. Our method achieves a more reduction in overoptimization.
  • Figure 4: Red Team Generation Quality Assessment Through Safety Tuning. We finetune the blue team model using a mix of successful red team queries and Alpaca dataset. This figure illustrates the robustness of response rate and OpenLLM Accuracy, demonstrating that safety tuning with DiveR-CT generated data better enhances LLM safety.
  • Figure 5: Diversity Metrics of Red Team Generations against More Capable Blue Team Models. We present the ASR and diversity metrics of red teaming queries by changing the blue team to more capable chat models: Meta-Llama-3-8B-Instruct (top)/ Llama-2-7b-chat-hf (bottom). By increasing attack difficulty, CRT decreased in ASR dramatically using their default safety coefficient. Despite having higher ASR than CRT, DiveR-CT outperforms their diversity metrics in both settings.
  • ...and 9 more figures