Table of Contents
Fetching ...

VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

Rohit Saxena, Alessandro Suglia, Pasquale Minervini

TL;DR

The results suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

Abstract

Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

TL;DR

The results suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

Abstract

Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.
Paper Structure (61 sections, 6 equations, 14 figures, 29 tables)

This paper contains 61 sections, 6 equations, 14 figures, 29 tables.

Figures (14)

  • Figure 1: The Severity Paradox. On MMBench (mean over 9 models), high-severity brightness reduction (center) causes only a 1.6pp accuracy drop, while low-severity glass blur (right) causes an 8.1pp drop. Severity level does not always predict model difficulty.
  • Figure 2: Top corruptions by severity (mean drop, 9 models). Resampling corruptions (upsample, elastic_transform) dominate at mid/high severity, while glass_blur shows an inverted pattern (Low $>$ Mid $>$ High) on both datasets.
  • Figure 3: Augmentation Visualization: Blur augmentations at low, mid, and high severity.
  • Figure 4: Augmentation Visualization: Noise augmentations at low, mid, and high severity.
  • Figure 5: Augmentation Visualization: Weather augmentations at low, mid, and high severity.
  • ...and 9 more figures