Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Bang An; Sicheng Zhu; Ruiyi Zhang; Michael-Andrei Panaitescu-Liess; Yuancheng Xu; Furong Huang

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Bang An, Sicheng Zhu, Ruiyi Zhang, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Furong Huang

TL;DR

This paper tackles the problem of false refusals in safety-aligned LLMs by introducing an automatic pseudo-harmful prompt generator and a large, finely labeled PHTest dataset tailored for chat scenarios. It defines a three-way harm taxonomy, develops surrogate objectives and a controllable autoregressive generation pipeline to create diverse, model-targeted prompts, and demonstrates that there is a fundamental trade-off between safety and usability. Evaluation across 20 LLMs reveals nuanced effects of model size and defenses on false-refusal rates, and shows that defenses can unintentionally increase FRRs. The work provides a practical toolkit for red-teaming and iterative improvement of safer, more usable LLMs, with public code and data to support ongoing benchmarking and tuning.

Abstract

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like "how to kill a mosquito," which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at https://github.com/umd-huang-lab/FalseRefusal

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 11 figures, 3 tables)

This paper contains 16 sections, 1 equation, 11 figures, 3 tables.

Introduction
Defining Harmless, Controversial, and Harmful Prompts
Automatic Pseudo-Harmful Prompt Generation
Surrogate Objectives
Generation Pipeline
Steering the Generated Content
PHTest: A Dataset for False Refusal Evaluation
Evaluation
Results
Safety vs False-Refusal Trade-off
Jailbreak Defenses Should Be Calibrated by False-Refusal Rates
Related Work
Conclusion
Experimental Details
Configuration
...and 1 more sections

Figures (11)

Figure 1: Examples of pseudo-harmful prompts generated by our method using llama2 as the target LLM, then transferred to closed-source LLMs.
Figure 2: Some controversial prompts generated by our method. Claude 3.5 Sonnet (shown) refuses to respond, while GPT 4o and Gemini 1.5 Flash do. The left and middle's harmfulness depends on definitions, while the right could have either innocent or malicious intentions.
Figure 3: Diagram of our automatic pseudo-harmful prompt generation.
Figure 4: (Left) LLM "recognizes" pseudo-harmfulness. Using only Llama2-8B's refusal likelihood, we classify pseudo-harmful (green) and harmful (red) XSTest prompts with AUC $79.3\%$. This suggests that pseudo-harmful prompts often lie on the boundary of the LLM's refusal decision. (Right) Using the LLM as a harmfulness judge often aligns better with human evaluation than seeing if it refuses the prompt.
Figure 5: Comparison of quantity and distribution between PHTest and XSTest. (Left) PHTest prompts have lower perplexity (mainly because XSTest prompts are generally shorter). (Right) XSTest prompts generally have a higher negative log-likelihood (NLL), making them more common in practice, while PHTest covers broader long-tail distributions.
...and 6 more figures

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

TL;DR

Abstract

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)