Table of Contents
Fetching ...

Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric

Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang, Xin Sun

TL;DR

The paper addresses the insufficiency of accuracy-only metrics for evaluating corruption robustness in LVLMs by introducing Bench-C, a discriminative benchmark emphasizing samples that reveal model differences under visual corruptions, and Robustness Alignment Score (RAS), which captures fine-grained shifts in prediction uncertainty and calibration. Bench-C combines 19 corruption types across 5 severities and a discriminative sampling strategy based on cross-model prediction variability and semantic diversity, yielding a focused testbed of 849 samples. RAS integrates uncertainty and calibration dynamics through a mathematically defined score that ranges from $-2$ to $1$, enabling a structural view of robustness beyond accuracy, including destructive and corrective robustness components. Across 13 LVLMs, the study finds robust patterns of behavior under corruption, demonstrates that higher clean accuracy does not guarantee structural robustness, and reveals a stability–adaptability tension as models either preserve or recover predictive structure under degradation. The work thus provides a principled framework for diagnosing and improving corruption robustness in LVLMs with practical implications for reliable real-world deployment.

Abstract

Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.

Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric

TL;DR

The paper addresses the insufficiency of accuracy-only metrics for evaluating corruption robustness in LVLMs by introducing Bench-C, a discriminative benchmark emphasizing samples that reveal model differences under visual corruptions, and Robustness Alignment Score (RAS), which captures fine-grained shifts in prediction uncertainty and calibration. Bench-C combines 19 corruption types across 5 severities and a discriminative sampling strategy based on cross-model prediction variability and semantic diversity, yielding a focused testbed of 849 samples. RAS integrates uncertainty and calibration dynamics through a mathematically defined score that ranges from to , enabling a structural view of robustness beyond accuracy, including destructive and corrective robustness components. Across 13 LVLMs, the study finds robust patterns of behavior under corruption, demonstrates that higher clean accuracy does not guarantee structural robustness, and reveals a stability–adaptability tension as models either preserve or recover predictive structure under degradation. The work thus provides a principled framework for diagnosing and improving corruption robustness in LVLMs with practical implications for reliable real-world deployment.

Abstract

Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.

Paper Structure

This paper contains 31 sections, 16 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Motivation of this paper. We address two primary limitations -- the dominance of low-discriminative samples in evaluation and the metric insensitivity to prediction structure degradation -- by introducing a discriminative benchmarking and a structure-aware Robustness Alignment Score (RAS).
  • Figure 2: The construction pipeline of Bench-C. The sample selection strategy integrates both semantic diversity and the prediction inconsistency across corruptions.
  • Figure 3: Representative prediction structure shift under corruption. Case 1: Degradation (uncertainty $\Delta \mathcal{S}\!\Uparrow,\,$ calibration $\Delta \mathcal{C}\!\Uparrow$), Case 2: Erroneous Overconfidence (uncertainty $\Delta \mathcal{S}\!\Downarrow,\,$ calibration $\Delta \mathcal{C}\!\Uparrow$), Case 3: Hesitation (uncertainty $\Delta \mathcal{S}\!\Uparrow,\,$ calibration $\Delta \mathcal{C}\!\Downarrow$), and Case 4: Stable (uncertainty $\Delta \mathcal{S}\!\Downarrow,\,$ calibration $\Delta \mathcal{C}\!\Downarrow$).
  • Figure 4: Behavioral analysis of mPLUG-Owl3 under Snow corruption ($\ell{=}5$). Each scatter plot visualizes samples in the $(\Delta \mathcal{S}, \Delta \mathcal{C})$ plane, divided into four behavioral types: (Erroneous) Overconfident, Degraded, Stable, and Hesitant. Panels (b)--(c) isolate samples that maintain or shift predictions. While $\Delta\mathrm{Acc.}$ measures correctness change, RAS exposes how uncertainty and calibration jointly evolve, revealing structural degradation patterns overlooked by accuracy alone.
  • Figure 5: Illustration of robustness analysis of DeepSeek-VL2-Small 2024deepseekvl2 across corruptions. The model exhibits distinct behavioral patterns. (a) Clean image as reference; (b) Defocus blur causes overconfidence and large RAS drop; (c) Brightness adjustment improves both calibration and confidence, leading to a positive RAS; (d) Spatter leads to correct yet hesitant prediction. The RAS summarizes the shifts in uncertainty ($\Delta \mathcal{S}$) and calibration error ($\Delta \mathcal{C}$), revealing distinct robustness landscape even when accuracy unchanged.
  • ...and 5 more figures