Table of Contents
Fetching ...

MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, Min Zhang

TL;DR

MTSA introduces a two-stage framework for securing LLMs in multi-turn dialogues by pairing thought-guided red-teaming with adversarial iterative optimization guided by future rewards. The method combines Think-before-Attack data, trajectory-based RLHF, and multi-turn reward modeling to jointly improve red-team attack strength and target-model safety across multiple dialogue rounds. Empirical results across diverse models and benchmarks show state-of-the-art red-team capabilities and substantial safety improvements with manageable over-safety and generalization costs. The work demonstrates a scalable approach to robust safety alignment in extended conversations, with insights into data efficiency, ablations, and practical deployment considerations.

Abstract

The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.

MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

TL;DR

MTSA introduces a two-stage framework for securing LLMs in multi-turn dialogues by pairing thought-guided red-teaming with adversarial iterative optimization guided by future rewards. The method combines Think-before-Attack data, trajectory-based RLHF, and multi-turn reward modeling to jointly improve red-team attack strength and target-model safety across multiple dialogue rounds. Empirical results across diverse models and benchmarks show state-of-the-art red-team capabilities and substantial safety improvements with manageable over-safety and generalization costs. The work demonstrates a scalable approach to robust safety alignment in extended conversations, with insights into data efficiency, ablations, and practical deployment considerations.

Abstract

The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.

Paper Structure

This paper contains 66 sections, 7 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) Previous Approach: only optimizes the model's performance in risky rounds. (b) Our Approach: aligns dangerous rounds using future rewards, enhancing the robustness of safety alignment.
  • Figure 2: The overview of MTSA framework. (1) Thought-guided Attack Learning Stage: the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. (2) Adversarial Iterative Optimization Stage: the red-team model interacts with the target model. The resulting interaction data, following trajectory sampling, is utilized to optimize both models.
  • Figure 3: Toxicity and diversity results after choosing different Top-k data for red-team initialization as well as one iteration of training (arrows point from the initial model to the model after iterative training).
  • Figure 4: Performance of the model under different rounds. (a) Comparison of MTSA-$R$ with other multi-round attack methods in terms of toxicity and diversity. (b) Comparison of target models optimized by different methods in terms of toxicity and helpfulness.
  • Figure 5: False rejection rate of the target model under different algorithmic optimizations.
  • ...and 1 more figures