Table of Contents
Fetching ...

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, Weiran Xu

TL;DR

SEAS addresses evolving safety vulnerabilities in large language models by coupling a dedicated SEAS dataset with a self-evolving adversarial pipeline that alternates between red-teaming prompts and target-model safety updates. The method comprises three iterative stages—Initialization, Attack, and Adversarial Optimization—where Red Team and Target models are continuously refined via Direct Preference Optimization-guided updates and a Safe Classifier assessor. Empirically, SEAS yields substantial safety improvements after three iterations, with the Target model’s safety approaching GPT-4 level on SEAS-Test while preserving general capabilities, and the Red Team’s attack success against competitive models increasing notably. The work supplies a new dataset, an open-source pipeline, and evidence that iterative, data-driven adversarial training can enhance LLM safety more efficiently than static red-teaming, with practical implications for safer deployment and ongoing risk assessment.

Abstract

As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the $\mathbf{S}\text{elf-}\mathbf{E}\text{volving }\mathbf{A}\text{dversarial }\mathbf{S}\text{afety }\mathbf{(SEAS)}$ optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to improve robustness and safety. This framework reduces reliance on manual testing and significantly enhances the security capabilities of LLMs. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and after three iterations, the Target model achieves a security level comparable to GPT-4, while the Red Team model shows a marked increase in attack success rate (ASR) against advanced models. Our code and datasets are released at https://SEAS-LLM.github.io/.

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

TL;DR

SEAS addresses evolving safety vulnerabilities in large language models by coupling a dedicated SEAS dataset with a self-evolving adversarial pipeline that alternates between red-teaming prompts and target-model safety updates. The method comprises three iterative stages—Initialization, Attack, and Adversarial Optimization—where Red Team and Target models are continuously refined via Direct Preference Optimization-guided updates and a Safe Classifier assessor. Empirically, SEAS yields substantial safety improvements after three iterations, with the Target model’s safety approaching GPT-4 level on SEAS-Test while preserving general capabilities, and the Red Team’s attack success against competitive models increasing notably. The work supplies a new dataset, an open-source pipeline, and evidence that iterative, data-driven adversarial training can enhance LLM safety more efficiently than static red-teaming, with practical implications for safer deployment and ongoing risk assessment.

Abstract

As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to improve robustness and safety. This framework reduces reliance on manual testing and significantly enhances the security capabilities of LLMs. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and after three iterations, the Target model achieves a security level comparable to GPT-4, while the Red Team model shows a marked increase in attack success rate (ASR) against advanced models. Our code and datasets are released at https://SEAS-LLM.github.io/.
Paper Structure (68 sections, 4 equations, 5 figures, 13 tables)

This paper contains 68 sections, 4 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: SEAS pipeline. Initialization Stage: Red Team model $R_0$ and Target model $T_0$ are fine-tuned using different dataset. Attack Stage: in the $(i+1)$th iteration, adversarial prompts are generated by activating $R_i$ using seed prompts to attack $T_i$, the responses are then evaluated by Safe Classifier, where label = 1 represents an unsafe response. Adversarial Optimization Stage: the optimization employs pair-wise loss for two models, selecting appropriate data based on the evaluation.
  • Figure 2: Examples of Risk Categories and Attack Styles, with sensitive terms masked. A harmless example from the harmless test set shares the same language style as the Adversarial Prefix. For more examples, see Appendix \ref{['sec:appendix_A']}.
  • Figure 3: Performance of Target models on the XSTest evaluations with Safe and Unsafe Prompts. the lower the Full Refusal and Partial Refusal rates, the better.
  • Figure 4: Incorrect Refusal on harmless test set. We use the same evaluation criteria as XSTest. "It" stand for "Instruct".
  • Figure 5: Examples of SEAS harmless test set.