DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints
Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-jin Liu, Zilong Zheng, Gao Huang
TL;DR
Diversity is crucial for robust LLM safety evaluation, but traditional automatic red-teaming methods overemphasize attack success rate, risking narrow prompt coverage and overfitting to safety proxies. The authors present DiveR-CT, which reframes red-teaming as constrained policy optimization and adds a dynamic, nearest-neighbor semantic diversity reward to adaptively cover the semantic space across history. Across extensive experiments, DiveR-CT achieves higher lexical and semantic diversity at controlled attack-success rates, produces safer and more effective data for blue-team tuning, and remains robust to changes in the unsafe-reward classifier and blue-model strength. This yields a scalable framework for red-teaming LLMs that can accelerate safe deployment in real-world settings.
Abstract
Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.
