"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Lingrui Mei; Shenghua Liu; Yiwei Wang; Baolong Bi; Jiayi Mao; Xueqi Cheng

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi Mao, Xueqi Cheng

TL;DR

The paper confronts the problem of overestimating jailbreaking risk in LLMs due to hallucinations in safety evaluations. It introduces BabyBLUE, a three-stage evaluation framework with six specialized evaluators (general, coherence, context, instruction, knowledge, toxicity) and an augmentation dataset to better distinguish genuine malicious outputs from hallucinations. BabyBLUE's pipeline emphasizes factual accuracy, contextual relevance, functional feasibility, and safety metrics, addressing false positives that plague prior benchmarks. Experimental results across multiple models and red-teaming methods show BabyBLUE reduces false positives and yields more reliable assessments of true harm potential, with notable differences between open- and closed-source models. The work highlights the importance of precise, continuously updated benchmarks for safe LLM deployment and provides a concrete path forward for safer red-teaming practices in AI safety research.

Abstract

"Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $\textbf{B}$enchmark for reli$\textbf{AB}$ilit$\textbf{Y}$ and jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

TL;DR

Abstract

enchmark for reli

ilit

and jail

reak ha

cination

valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.

Paper Structure (51 sections, 6 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 51 sections, 6 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Hallucination in Jailbreaks
Hallucination Categories
False Positives in the Existing Evaluations
Lexical Matching
Prompted LLMs
Trained Classifiers
Human Reviewers
BabyBLUE
Evaluation Framework
General Evaluator
Coherence Evaluator
Context Evaluator
Instruction Evaluator
Knowledge Evaluator
...and 36 more sections

Figures (5)

Figure 1: A real example of a jailbreak prompt. Harmless completions provide non-informative or vague responses, while harmful completions offer dangerous instructions.
Figure 2: Examples of various hallucinations in LLM completions. (a) The completion contradicts the harmful intent of the prompt by advocating against hate. (b) The completion starts addressing the prompt but then veers off to an unrelated topic. (c) The completion provides inaccurate or dangerous instructions based on the prompt. (d) The completion is logically inconsistent and incoherent, failing to provide a sensible response.
Figure 3: Overview of the BabyBLUE evaluation pipeline. The pipeline evaluates adversarial behaviors through classification, textual, and functionality stages to determine if outputs are malicious or not.
Figure 4: Average ASR for different evaluators across models. The models on the x-axis are sorted by the overall average ASR from lowest to highest. The shaded regions represent the variance in ASR for each evaluator. For full results, see Appendix \ref{['sec:full_results']}.
Figure 5: Comparison of average ASR across different evaluators for various categories of behaviors.

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

TL;DR

Abstract

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Authors

TL;DR

Abstract

Table of Contents

Figures (5)