Table of Contents
Fetching ...

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su

TL;DR

This paper reveals a previously overlooked safety vulnerability of LLMs in multi-turn dialogues, showing that decomposing malicious prompts into sub-queries enables incremental harm across turns. It introduces a Malicious Query Decomposition paradigm and demonstrates, through extensive experiments on commercial LLMs, that per-turn alignment fails to prevent harmful outcomes when dialogue spans multiple topics. The work provides evidence that role-playing and increased dialogue length exacerbate risk and offers mitigation directions, including multi-turn safety data and enhanced context understanding. Overall, the findings argue for dedicated multi-turn alignment and safeguards to prevent illegal or unethical outputs in advanced AI assistants.

Abstract

Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

TL;DR

This paper reveals a previously overlooked safety vulnerability of LLMs in multi-turn dialogues, showing that decomposing malicious prompts into sub-queries enables incremental harm across turns. It introduces a Malicious Query Decomposition paradigm and demonstrates, through extensive experiments on commercial LLMs, that per-turn alignment fails to prevent harmful outcomes when dialogue spans multiple topics. The work provides evidence that role-playing and increased dialogue length exacerbate risk and offers mitigation directions, including multi-turn safety data and enhanced context understanding. Overall, the findings argue for dedicated multi-turn alignment and safeguards to prevent illegal or unethical outputs in advanced AI assistants.

Abstract

Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.
Paper Structure (30 sections, 18 figures, 5 tables)

This paper contains 30 sections, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Decompose a malicious question and induce aligned LLMs into several sub-questions to form a harmful multi-turn dialogue step by step.
  • Figure 2: For a single-turn interaction, it has become has become increasingly difficult for users to prompt the model to directly response to malicious questions, such as '...how to steal a credit card...', due to the alignment mechanisms that ensure language models adhere to human values. However, a malicious question can be broken down into several sub-questions, and by interacting with the model using these questions across multi-turn dialogue, the model can still 'speak out of turn,' as demonstrated in the examples in the figures. Each turn generates borderline harmful or cautionary content, except for the final turn, which specifically triggers harmful knowledge. Each turn of dialogue forms part of a harmful conversation, and overall, the entire multi-turn dialogue is harmful.
  • Figure 3: Malicious Query Decomposition Paradigm. Four main instructions guide manual decomposition, while automatic decomposition relies on GPT-4, utilizing several manually decomposed examples as a few-shot demonstration and the requirements for prompt transformation as the Transfer Prompt.
  • Figure 4: Harmfulness evaluation across various models scored by GPT-4.
  • Figure 5: The impact of the number of turns on dialogue harmfulness.
  • ...and 13 more figures