MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, Min Zhang
TL;DR
MTSA introduces a two-stage framework for securing LLMs in multi-turn dialogues by pairing thought-guided red-teaming with adversarial iterative optimization guided by future rewards. The method combines Think-before-Attack data, trajectory-based RLHF, and multi-turn reward modeling to jointly improve red-team attack strength and target-model safety across multiple dialogue rounds. Empirical results across diverse models and benchmarks show state-of-the-art red-team capabilities and substantial safety improvements with manageable over-safety and generalization costs. The work demonstrates a scalable approach to robust safety alignment in extended conversations, with insights into data efficiency, ablations, and practical deployment considerations.
Abstract
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.
