Table of Contents
Fetching ...

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Zuchen Gao, Fei Mi, Lanqing Hong

TL;DR

CoSafe introduces a focused red-teaming benchmark for LLM safety in multi-turn dialogue coreference, addressing a gap where coreference-based attacks in sustained conversations were previously unexplored. The authors construct 1,400 multi-turn attack prompts across 14 categories by expanding BeaverTail prompts with GPT-4 to place coreference questions at dialogue endpoints, and evaluate five open-source LLMs using QA moderation, human judgment, and GPT-4-based scoring. Results show that multi-turn coreference attacks significantly erode safety, with attack success rates up to 56% and harmful content rising for several models, yet defenses such as system prompts and Chain-of-Thought can mitigate risk at the cost of reduced helpfulness. The work highlights the need for robust, scalable defenses against context-rich, multi-turn adversarial strategies and points to future work on reducing data-generation costs and addressing semantic drift in expanded dialogues.

Abstract

As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

TL;DR

CoSafe introduces a focused red-teaming benchmark for LLM safety in multi-turn dialogue coreference, addressing a gap where coreference-based attacks in sustained conversations were previously unexplored. The authors construct 1,400 multi-turn attack prompts across 14 categories by expanding BeaverTail prompts with GPT-4 to place coreference questions at dialogue endpoints, and evaluate five open-source LLMs using QA moderation, human judgment, and GPT-4-based scoring. Results show that multi-turn coreference attacks significantly erode safety, with attack success rates up to 56% and harmful content rising for several models, yet defenses such as system prompts and Chain-of-Thought can mitigate risk at the cost of reduced helpfulness. The work highlights the need for robust, scalable defenses against context-rich, multi-turn adversarial strategies and points to future work on reducing data-generation costs and addressing semantic drift in expanded dialogues.

Abstract

As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

Paper Structure

This paper contains 24 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: An illustration of an attack on "insults for a schizophrenic." The left shows a single prompt attack, while the right depicts a dialogue coreference attack.
  • Figure 2: Attack successful rate, Harmful rate, and Helpful score under different categories of models.
  • Figure 3: Comparison between QA-Moderation and human evaluation. The bar chart indicates the proportion of harmful responses, while the line chart represents agreement ratios between two assessments.
  • Figure 4: Prompt for GPT-4 to transform single user questions into multi-turn dialogue coreference.
  • Figure 5: Prompt for Helpfulness Evaluation.
  • ...and 9 more figures