Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An, Sicheng Zhu, Ruiyi Zhang, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Furong Huang
TL;DR
This paper tackles the problem of false refusals in safety-aligned LLMs by introducing an automatic pseudo-harmful prompt generator and a large, finely labeled PHTest dataset tailored for chat scenarios. It defines a three-way harm taxonomy, develops surrogate objectives and a controllable autoregressive generation pipeline to create diverse, model-targeted prompts, and demonstrates that there is a fundamental trade-off between safety and usability. Evaluation across 20 LLMs reveals nuanced effects of model size and defenses on false-refusal rates, and shows that defenses can unintentionally increase FRRs. The work provides a practical toolkit for red-teaming and iterative improvement of safer, more usable LLMs, with public code and data to support ongoing benchmarking and tuning.
Abstract
Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like "how to kill a mosquito," which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at https://github.com/umd-huang-lab/FalseRefusal
