Table of Contents
Fetching ...

SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng

TL;DR

SafeDialBench introduces a fine-grained, bilingual safety benchmark for evaluating LLMs in multi-turn dialogues under diverse jailbreak attacks. It uses a two-tier taxonomy across six safety dimensions, and generates 4,053 dialogues via seven jailbreak strategies over 22 scenarios in English and Chinese, enabling a three-ability safety evaluation (identify, handle, and maintain consistency) evaluated by both LLMs and human judges. Experimental results across 17 LLMs show Yi-34B-Chat and GLM4-9B-Chat as top safety performers, with Llama3.1-8B-Instruct and o3-mini showing vulnerabilities; fallacy, purpose reverse, and role-play attacks are particularly effective. The work demonstrates high agreement with human judgments and provides actionable insights for improving safety controls in multilingual, multi-turn dialogue settings.

Abstract

With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.

SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

TL;DR

SafeDialBench introduces a fine-grained, bilingual safety benchmark for evaluating LLMs in multi-turn dialogues under diverse jailbreak attacks. It uses a two-tier taxonomy across six safety dimensions, and generates 4,053 dialogues via seven jailbreak strategies over 22 scenarios in English and Chinese, enabling a three-ability safety evaluation (identify, handle, and maintain consistency) evaluated by both LLMs and human judges. Experimental results across 17 LLMs show Yi-34B-Chat and GLM4-9B-Chat as top safety performers, with Llama3.1-8B-Instruct and o3-mini showing vulnerabilities; fallacy, purpose reverse, and role-play attacks are particularly effective. The work demonstrates high agreement with human judgments and provides actionable insights for improving safety controls in multilingual, multi-turn dialogue settings.

Abstract

With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.

Paper Structure

This paper contains 52 sections, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Overall framework of SafeDialBench. 1) Safety Taxonomy: propose a safety taxonomy comprising $6$ categories. 2) Data Construction: construct datasets with $7$ jailbreak attack methods based on $6$ categories within $22$ dialogue scenarios 3) LLM Evaluation: evaluate LLMs based on $3$ safety abilities with LLMs and human judgment.
  • Figure 2: The two-tier hierarchical safety taxonomy.
  • Figure 3: Example of dialogue and model evaluation for ethics under scene construct attack.
  • Figure 4: Results of $4$ LLMs across $7$ jailbreak attack methods in ethics and morality dimensions, with results for the remaining $4$ dimensions provided in Appendix \ref{['sec:appendix_jailbreak_Results']}.
  • Figure 5: Model performance across dialogue turns under different jailbreak attack methods. FA, RP, and RA mean fallacy attack, role play and reference attack methods, respectively.
  • ...and 14 more figures