Analysing the Robustness of Vision-Language-Models to Common Corruptions
Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, Umair Bin Mansoor
TL;DR
This study analyzes the robustness of vision-language models to common image corruptions by evaluating 19 ImageNet-C types across noise, blur, weather, and digital categories. It introduces two corruption-augmented benchmarks, TextVQA-C and GQA-C, and uses the LLaVA-1.5 model to uncover task-specific vulnerability patterns. A frequency-domain perspective shows that transformers’ low-frequency processing bias explains why text understanding and object reasoning degrade differently under various corruptions. The findings provide practical guidance for designing corruption-robust vision-language architectures suitable for real-world deployment, particularly in OCR-enabled text reasoning and scene understanding tasks.
Abstract
Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.
