Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric
Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang, Xin Sun
TL;DR
The paper addresses the insufficiency of accuracy-only metrics for evaluating corruption robustness in LVLMs by introducing Bench-C, a discriminative benchmark emphasizing samples that reveal model differences under visual corruptions, and Robustness Alignment Score (RAS), which captures fine-grained shifts in prediction uncertainty and calibration. Bench-C combines 19 corruption types across 5 severities and a discriminative sampling strategy based on cross-model prediction variability and semantic diversity, yielding a focused testbed of 849 samples. RAS integrates uncertainty and calibration dynamics through a mathematically defined score that ranges from $-2$ to $1$, enabling a structural view of robustness beyond accuracy, including destructive and corrective robustness components. Across 13 LVLMs, the study finds robust patterns of behavior under corruption, demonstrates that higher clean accuracy does not guarantee structural robustness, and reveals a stability–adaptability tension as models either preserve or recover predictive structure under degradation. The work thus provides a principled framework for diagnosing and improving corruption robustness in LVLMs with practical implications for reliable real-world deployment.
Abstract
Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
