Table of Contents
Fetching ...

DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

Kaixuan Ren, Preslav Nakov, Usman Naseem

TL;DR

This work targets the gap in multimodal safety by introducing DUAL-Bench, a benchmark that measures over-refusal and safe completion in vision-language models across 12 hazard categories with semantics-preserving perturbations. It formalizes dual-use task setup, introduces an LLM-as-a-Judge evaluation framework, and defines metrics including RR, $DAR$, $\Delta IR$, and SCR to capture safety and usefulness under distribution shifts. The large-scale evaluation of 18 VLMs reveals substantial room for improvement, with the best safe completion rates around $12.9\%$ (GPT-5-Nano) and average SCR near $7.9\%$ for GPT-5, while robustness under perturbations remains uneven across families. The findings underscore the need for explicit safe-completion strategies and joint optimization of safety and helpfulness to build more trustworthy multimodal systems.

Abstract

As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.

DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

TL;DR

This work targets the gap in multimodal safety by introducing DUAL-Bench, a benchmark that measures over-refusal and safe completion in vision-language models across 12 hazard categories with semantics-preserving perturbations. It formalizes dual-use task setup, introduces an LLM-as-a-Judge evaluation framework, and defines metrics including RR, , , and SCR to capture safety and usefulness under distribution shifts. The large-scale evaluation of 18 VLMs reveals substantial room for improvement, with the best safe completion rates around (GPT-5-Nano) and average SCR near for GPT-5, while robustness under perturbations remains uneven across families. The findings underscore the need for explicit safe-completion strategies and joint optimization of safety and helpfulness to build more trustworthy multimodal systems.

Abstract

As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.

Paper Structure

This paper contains 63 sections, 4 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Illustration of three model behaviors. Given the same benign prompt (“Describe this image”), models may (i) provide a Direct Answer (which may be unsafe), (ii) produce a Direct Refusal (which is unhelpful), or (iii) achieve a Safe Completion, which balances usefulness and safety.
  • Figure 2: Safety Completion vs. Refusal. Each point is a VLM (marker shape = model family; dashed line = across-model regression). The $y$-axis reports Safety Completion Rate (SCR) under the Help+Use criterion yuan_hard_2025; the $x$-axis reports Refusal Rate (RR) following OR-Bench cui2025orbenchoverrefusalbenchmarklarge. Upper-left indicates more useful and less refusals; bottom-right indicates helpless behavior.
  • Figure 3: Overall models' performance across safety-related categories. Full category definitions are given in the footnote.
  • Figure 4: Family-wise models' performance across safety-related categories. Same setting as Figure \ref{['fig:radar_overall']}.
  • Figure 5: Safe Completion Rate (top) and Refusal Rate (bottom) under five perturbations on harmful content images. The perturbations include four image-level transformations and one text-level transformation. Results are plotted relative to the original baseline (red line at 0), with each curve showing the deviation of a perturbation from the Original across models.
  • ...and 10 more figures