Table of Contents
Fetching ...

Visual Robustness Benchmark for Visual Question Answering (VQA)

Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Md Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, Md. Azam Hossain

TL;DR

This work presents the first large-scale benchmark to evaluate the visual robustness of VQA models, including multimodal large language models and zero-shot evaluation, and assess the strength of the realistic corruption effects.

Abstract

Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.

Visual Robustness Benchmark for Visual Question Answering (VQA)

TL;DR

This work presents the first large-scale benchmark to evaluate the visual robustness of VQA models, including multimodal large language models and zero-shot evaluation, and assess the strength of the realistic corruption effects.

Abstract

Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.
Paper Structure (43 sections, 21 equations, 11 figures, 6 tables)

This paper contains 43 sections, 21 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Architecture of the Visual Robustness Framework and its components -- Model Repository: Hosts multiple VQA models for inference, Generator: Applies the corruption functions to the VQA dataset and generate multiple augmented datasets, Inference Module: Inferences on the augmented datasets using models from the model repository, Robustness Evaluation Module: Evaluates the results of the inference module by computing different robustness metrics introduced in this work, Visualization Module: Produces visualizations based on the predicted answers. The VQA datasets, models, and corruptions are the input to the framework while the VRE scores, accuracy scores, and visualizations will be produced as the output.
  • Figure 2: Effects of the $14$ visual corruption functions at the highest severity level $5$ introduced in our work. The functions are categorized into $6$ corruption classes.
  • Figure 3: Visualization of relative accuracy drop $A^{rel}_{v,c}$ of the models and visual corruptions. Significantly higher drops are observed for PNP tiong2022plug, shot noise, and zoom blur compared to other models and corruptions, while the brightness corruption induces a negligible accuracy drop for the standard VQA models.
  • Figure 4: Average error $\mu_{v,c}$ of various models and visual corruption effects. The average error is the complement of average accuracy and can serve as a simple performance indicator for assessing model robustness.
  • Figure 5: (a)-(f) Set of images with increasing severity level when exposed to the splatter effect, (g)-(l) The corresponding Grad-CAM explanations selvaraju2017grad employed by PNP tiong2022plug showing variations in region localization, (m)-(r) The top 3 predictions and their associated probabilities, highlighting the inconsistencies in the corresponding caption predictions due to the visual corruption.
  • ...and 6 more figures