Table of Contents
Fetching ...

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab

TL;DR

This work introduces RefusalBench, a generative evaluation framework for assessing selective refusal in grounded language models. It combines a linguistically grounded taxonomy of informational uncertainty with a perturbation engine that yields 176 controlled perturbations across six uncertainty types and three intensity levels, enabling diagnostic, dynamic evaluation of refusal behavior. A multi-model generator–verifier pipeline guarantees perturbation quality via unanimous consensus, and two benchmarks (RefusalBench-NQ and RefusalBench-GaRAGe) quantify refusal detection and categorization under single- and multi-document grounding. Across 30+ frontier models, results show that selective refusal remains a significant, trainable capability gap, scaling independently from answer quality and being highly sensitive to alignment methods and domain context. The framework demonstrates that dynamic, contamination-resistant evaluation can guide targeted safety improvements and is applicable to broader capabilities beyond refusal.

Abstract

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks -- RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) -- and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

TL;DR

This work introduces RefusalBench, a generative evaluation framework for assessing selective refusal in grounded language models. It combines a linguistically grounded taxonomy of informational uncertainty with a perturbation engine that yields 176 controlled perturbations across six uncertainty types and three intensity levels, enabling diagnostic, dynamic evaluation of refusal behavior. A multi-model generator–verifier pipeline guarantees perturbation quality via unanimous consensus, and two benchmarks (RefusalBench-NQ and RefusalBench-GaRAGe) quantify refusal detection and categorization under single- and multi-document grounding. Across 30+ frontier models, results show that selective refusal remains a significant, trainable capability gap, scaling independently from answer quality and being highly sensitive to alignment methods and domain context. The framework demonstrates that dynamic, contamination-resistant evaluation can guide targeted safety improvements and is applicable to broader capabilities beyond refusal.

Abstract

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks -- RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) -- and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

Paper Structure

This paper contains 112 sections, 3 theorems, 14 equations, 32 figures, 3 tables.

Key Result

Theorem 3.1

Let $\hat{g}^{\text{stat}}_t$ and $\hat{g}^{\text{gen}}_t$ be the round-$t$ static and generative estimators based on $n$ and $m_t$ samples, respectively. For any error tolerance $\epsilon > 0$:

Figures (32)

  • Figure 1: The RefusalBench pipeline transforms base QA datasets into diagnostic benchmarks through systematic linguistic perturbations using language models. The generator-verifier architecture ensures quality at scale.
  • Figure 2: Stratified coverage heatmaps for both benchmarks. Left: RefusalBench-NQ demonstrates balanced distribution of 1,600 samples across all 18 perturbation types and intensities. Right: RefusalBench-GaRAGe exhibits naturally imbalanced distribution of 1,506 samples across perturbation types.
  • Figure 3: Generator-verifier pass rate matrices reveal significant self-evaluation bias. Models consistently rate their own outputs more favorably than peers.
  • Figure 4: Generator pass rates reveal universal model capabilities: all models excel at creating explicit logical flaws (EpistemicMismatch, Contradiction, FalsePremise) but struggle with implicit reasoning tasks (Ambiguity and MissingInfo).
  • Figure 5: Answer vs. Refusal Accuracy of frontier models on both benchmarks. No model achieves excellence (>80%) on both dimensions simultaneously. Left: RefusalBench-NQ. Right: RefusalBench-GaRAGe.
  • ...and 27 more figures

Theorems & Definitions (5)

  • Theorem 3.1: Measurement Error Under Contamination
  • Theorem B.1: Measurement Error Under Contamination
  • proof
  • Corollary B.1: Static failure under contamination
  • proof