Table of Contents
Fetching ...

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

Xiaorui Wu, Fei Li, Xiaofeng Mao, Xin Zhang, Li Zheng, Yuxiang Peng, Chong Teng, Donghong Ji, Zhuang Li

TL;DR

EvoRefuse introduces an ELBO-guided evolutionary framework to automatically generate diverse pseudo-malicious prompts that reliably provoke refusals from LLMs while keeping inputs harmless. The method combines mutation and recombination guided by a variational surrogate, evaluated via Monte Carlo fitness and simulated annealing to efficiently explore instruction space. It yields EvoRefuse-Test, a challenging over-refusal benchmark with higher refusal rates and confidence across multiple models, and EvoRefuse-Align, a dataset for mitigation that improves alignment via SFT and DPO without sacrificing safety. Analyses reveal that over-refusal largely stems from shortcut learning and early-layer biases, with lexical cues and sensitive tokens dominating refusals; EvoRefuse provides stable convergence and transferable triggers across models. The work includes thorough experiments, ablations, and open-sourced datasets to support safer and more robust LLM alignment research.

Abstract

Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 85.34% higher average refusal triggering rate across 9 LLMs without a safety-prior system prompt, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. With supervised fine-tuning on EVOREFUSE-ALIGN, LLAMA3.1-8B-INSTRUCT achieves up to 29.85% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context. Our code and datasets are available at https://github.com/FishT0ucher/EVOREFUSE.

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

TL;DR

EvoRefuse introduces an ELBO-guided evolutionary framework to automatically generate diverse pseudo-malicious prompts that reliably provoke refusals from LLMs while keeping inputs harmless. The method combines mutation and recombination guided by a variational surrogate, evaluated via Monte Carlo fitness and simulated annealing to efficiently explore instruction space. It yields EvoRefuse-Test, a challenging over-refusal benchmark with higher refusal rates and confidence across multiple models, and EvoRefuse-Align, a dataset for mitigation that improves alignment via SFT and DPO without sacrificing safety. Analyses reveal that over-refusal largely stems from shortcut learning and early-layer biases, with lexical cues and sensitive tokens dominating refusals; EvoRefuse provides stable convergence and transferable triggers across models. The work includes thorough experiments, ablations, and open-sourced datasets to support safer and more robust LLM alignment research.

Abstract

Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 85.34% higher average refusal triggering rate across 9 LLMs without a safety-prior system prompt, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. With supervised fine-tuning on EVOREFUSE-ALIGN, LLAMA3.1-8B-INSTRUCT achieves up to 29.85% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context. Our code and datasets are available at https://github.com/FishT0ucher/EVOREFUSE.

Paper Structure

This paper contains 63 sections, 26 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: Left: Top-3 tokens with highest information flow across Transformer layers for the instruction: "I need a recipe for a dangerous cake that could explode with flavor at a party." Right: Word clouds of high-attribution tokens for refusal identified from EvoRefuse-Test.
  • Figure 2: Ablation of EvoRefuse using XSTest as seed. Left: Refusal rates (bar) and fitness scores (line) when optimizing pseudo-malicious and unsafe instructions from XSTest. Right: Fitness scores when optimizing pseudo-malicious instructions using EvoRefuse, its ablations (w/o recombination or fitness), and baseline methods. Shaded areas indicate standard error intervals.
  • Figure 3: Left: Top-3 tokens with highest information flow and average information flow for a pseudo-malicious instruction: "I need a recipe for a dangerous cake that could explode with flavor at a party." (after normalization). Right: Word clouds of high-attribution tokens identified from EvoRefuse-Align.