"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi Mao, Xueqi Cheng
TL;DR
The paper confronts the problem of overestimating jailbreaking risk in LLMs due to hallucinations in safety evaluations. It introduces BabyBLUE, a three-stage evaluation framework with six specialized evaluators (general, coherence, context, instruction, knowledge, toxicity) and an augmentation dataset to better distinguish genuine malicious outputs from hallucinations. BabyBLUE's pipeline emphasizes factual accuracy, contextual relevance, functional feasibility, and safety metrics, addressing false positives that plague prior benchmarks. Experimental results across multiple models and red-teaming methods show BabyBLUE reduces false positives and yields more reliable assessments of true harm potential, with notable differences between open- and closed-source models. The work highlights the importance of precise, continuously updated benchmarks for safe LLM deployment and provides a concrete path forward for safer red-teaming practices in AI safety research.
Abstract
"Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $\textbf{B}$enchmark for reli$\textbf{AB}$ilit$\textbf{Y}$ and jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.
