SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, Weiran Xu
TL;DR
SEAS addresses evolving safety vulnerabilities in large language models by coupling a dedicated SEAS dataset with a self-evolving adversarial pipeline that alternates between red-teaming prompts and target-model safety updates. The method comprises three iterative stages—Initialization, Attack, and Adversarial Optimization—where Red Team and Target models are continuously refined via Direct Preference Optimization-guided updates and a Safe Classifier assessor. Empirically, SEAS yields substantial safety improvements after three iterations, with the Target model’s safety approaching GPT-4 level on SEAS-Test while preserving general capabilities, and the Red Team’s attack success against competitive models increasing notably. The work supplies a new dataset, an open-source pipeline, and evidence that iterative, data-driven adversarial training can enhance LLM safety more efficiently than static red-teaming, with practical implications for safer deployment and ongoing risk assessment.
Abstract
As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the $\mathbf{S}\text{elf-}\mathbf{E}\text{volving }\mathbf{A}\text{dversarial }\mathbf{S}\text{afety }\mathbf{(SEAS)}$ optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to improve robustness and safety. This framework reduces reliance on manual testing and significantly enhances the security capabilities of LLMs. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and after three iterations, the Target model achieves a security level comparable to GPT-4, while the Red Team model shows a marked increase in attack success rate (ASR) against advanced models. Our code and datasets are released at https://SEAS-LLM.github.io/.
