Table of Contents
Fetching ...

Lifelong Safety Alignment for Language Models

Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang

TL;DR

This work tackles the challenge of maintaining safety alignment for LLMs in the face of unseen, evolving jailbreaking attempts. It introduces a Lifelong Safety Alignment Framework built around a competitive two-player setup: a Meta-Attacker that discovers novel jailbreak strategies and a Defender that learns to resist them, with a warm-up phase that leverages GPT-4o to mine strategies from jailbreak literature. Through iterative Adversarial-Play Evolution, the Defender becomes progressively more robust (e.g., achieving 0% ASR on seen tasks after two iterations) while the Meta-Attacker uncovers increasingly sophisticated attacks, including multi-turn–like strategies. The approach demonstrates strong robustness improvements, transferability, and maintained helpfulness, and code is released to enable replication and further research in safer open-ended deployment of LLMs.

Abstract

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

Lifelong Safety Alignment for Language Models

TL;DR

This work tackles the challenge of maintaining safety alignment for LLMs in the face of unseen, evolving jailbreaking attempts. It introduces a Lifelong Safety Alignment Framework built around a competitive two-player setup: a Meta-Attacker that discovers novel jailbreak strategies and a Defender that learns to resist them, with a warm-up phase that leverages GPT-4o to mine strategies from jailbreak literature. Through iterative Adversarial-Play Evolution, the Defender becomes progressively more robust (e.g., achieving 0% ASR on seen tasks after two iterations) while the Meta-Attacker uncovers increasingly sophisticated attacks, including multi-turn–like strategies. The approach demonstrates strong robustness improvements, transferability, and maintained helpfulness, and code is released to enable replication and further research in safer open-ended deployment of LLMs.

Abstract

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Evolution of successful jailbreak strategies across iterations in our lifelong safety alignment framework. Left: Strategies that succeed against the initial model $\bm{M_0}$. Right: Strategies that succeed against the updated Defender $\bm{M_1}$. Notably, the dominant strategy category "Fictional Scenarios & Role-Playing" drops from the majority to under 5% in the second iteration, indicating that $\bm{M_1}$ effectively defends against these attacks through adversarial-play evolution.
  • Figure 2: Lifelong safety alignment framework. In the Warm-Up Stage (Step 1), a powerful LLM $\bm{M_{api}}$ (e.g., GPT-4o) is used to analyze jailbreak-related research papers and open-source codes. Key strategies $\bm{s}$ are extracted and used by the initial Meta-Attacker $\bm{A_0}$ to generate jailbreak questions $\bm{x}$ targeting specific goals $\bm{g}$. These are submitted to the target model $\bm{M_0}$, producing responses $\bm{y}$, and forming tuples $\bm{(s, x, y, g)}$ that are categorized into success buffer $\bm{B_s}$ or failure buffer $\bm{B_f}$. In the Lifelong Safety Alignment Stage (Steps 2–4), the Meta-Attacker and Defender co-evolve through iterative interaction. The Meta-Attacker learns from failed cases in $\bm{B_f}$ and generates new attack strategies and questions, which are again evaluated against $\bm{M_0}$. A safeguard model $\bm{M_j}$ assesses the responses and updates the buffers $\bm{B_s}$ and $\bm{B_f}$ accordingly. Successful tuples in $\bm{B_s}$ are used to further evolve the Meta-Attacker via beam search beamsearch and Reject Fine-Tuning dong2023raftyuan2023scaling, forming an iterative Adversarial-Play Evolution Loop. This loop continues until one of two conditions is met: (1) the goal success rate exceeds a threshold $\bm{K}$, or (2) the maximum number of iterations $\bm{N}$ is reached. At the end of each loop, the Defender $\bm{M_0}$ is updated through refusal training using successful attack cases in $\bm{B_s}$ and refusal outputs from the refusal model $\bm{M_r}$.
  • Figure 3: The extracted strategy from CodeAttack ren2024codeattack by the API model GPT-4o.
  • Figure 4: The extracted strategy from Random Augment Attack randomaugment by the API model GPT-4o.
  • Figure 5: The extracted strategy from Past Tense Attack pasttense by the API model GPT-4o.
  • ...and 1 more figures