Table of Contents
Fetching ...

Knowledge-Based Counterfactual Queries for Visual Question Answering

Theodoti Stoikou, Maria Lymperaiou, Giorgos Stamou

TL;DR

This work tackles the interpretability and robustness challenges in Visual Question Answering by introducing knowledge-based counterfactual queries that perturb the linguistic input. It leverages structured knowledge sources (WordNet and color hierarchies) to generate minimal, linguistically feasible substitutions, enabling model-agnostic probing of VQA systems. The framework yields both local explanations (per-question behavior) and global rules (patterns across the dataset) that reveal biases and weaknesses in reasoning, demonstrated on VQA-v2 and Visual Genome with ViLT as proof-of-concept. Results show notable accuracy declines under counterfactuals, particularly for color and semantic substitutions, underscoring the method's utility for diagnosing robustness and guiding explainability improvements in visiolinguistic models.

Abstract

Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.

Knowledge-Based Counterfactual Queries for Visual Question Answering

TL;DR

This work tackles the interpretability and robustness challenges in Visual Question Answering by introducing knowledge-based counterfactual queries that perturb the linguistic input. It leverages structured knowledge sources (WordNet and color hierarchies) to generate minimal, linguistically feasible substitutions, enabling model-agnostic probing of VQA systems. The framework yields both local explanations (per-question behavior) and global rules (patterns across the dataset) that reveal biases and weaknesses in reasoning, demonstrated on VQA-v2 and Visual Genome with ViLT as proof-of-concept. Results show notable accuracy declines under counterfactuals, particularly for color and semantic substitutions, underscoring the method's utility for diagnosing robustness and guiding explainability improvements in visiolinguistic models.

Abstract

Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.
Paper Structure (18 sections, 10 figures, 3 tables)

This paper contains 18 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Example of image and free-form questions retrieved from the Visual Genome dataset visualgenome, targeted to the VQA task. The displayed answers were given as response by the ViLT model vilt.
  • Figure 2: Overview of our proposed knowledge-based counterfactual VQA framework.
  • Figure 3: Local explanations for Color Maximal counterfactual perturbations.
  • Figure 4: Local explanations for Color Minimal counterfactual perturbations.
  • Figure 5: Local explanations for Synonym Adjectives counterfactual perturbations.
  • ...and 5 more figures