Table of Contents
Fetching ...

Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering

Pascal Tilli, Ngoc Thang Vu

TL;DR

This work addresses the need for intrinsically interpretable graph-based visual question answering by introducing discrete subgraph sampling as explanations. It integrates multiple sampling methods—Aimle, Imle, Simple, and Gumbel Sub-ST—into a gvqa system that uses CLIP-based embeddings and fixed subgraph size, and it evaluates performance on GQA using accuracy and token co-occurrence metrics (At-coo and Qt-coo), complemented by a human study. The results show that Aimle and Simple achieve strong accuracy with high explanatory co-occurrences, while Gumbel SoftSub-ST underperforms unless carefully tuned; human preferences align with At-coo/Qt-coo rankings, validating these metrics as interpretable proxies. Overall, the paper provides a principled comparison and practical guidance for selecting intrinsic subgraph sampling methods to balance interpretability and predictive performance in multimodal reasoning tasks.

Abstract

Explainable artificial intelligence (XAI) aims to make machine learning models more transparent. While many approaches focus on generating explanations post-hoc, interpretable approaches, which generate the explanations intrinsically alongside the predictions, are relatively rare. In this work, we integrate different discrete subset sampling methods into a graph-based visual question answering system to compare their effectiveness in generating interpretable explanatory subgraphs intrinsically. We evaluate the methods on the GQA dataset and show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy, while also achieving strong co-occurrences between answer and question tokens. Furthermore, we conduct a human evaluation to assess the interpretability of the generated subgraphs using a comparative setting with the extended Bradley-Terry model, showing that the answer and question token co-occurrence metrics strongly correlate with human preferences. Our source code is publicly available.

Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering

TL;DR

This work addresses the need for intrinsically interpretable graph-based visual question answering by introducing discrete subgraph sampling as explanations. It integrates multiple sampling methods—Aimle, Imle, Simple, and Gumbel Sub-ST—into a gvqa system that uses CLIP-based embeddings and fixed subgraph size, and it evaluates performance on GQA using accuracy and token co-occurrence metrics (At-coo and Qt-coo), complemented by a human study. The results show that Aimle and Simple achieve strong accuracy with high explanatory co-occurrences, while Gumbel SoftSub-ST underperforms unless carefully tuned; human preferences align with At-coo/Qt-coo rankings, validating these metrics as interpretable proxies. Overall, the paper provides a principled comparison and practical guidance for selecting intrinsic subgraph sampling methods to balance interpretability and predictive performance in multimodal reasoning tasks.

Abstract

Explainable artificial intelligence (XAI) aims to make machine learning models more transparent. While many approaches focus on generating explanations post-hoc, interpretable approaches, which generate the explanations intrinsically alongside the predictions, are relatively rare. In this work, we integrate different discrete subset sampling methods into a graph-based visual question answering system to compare their effectiveness in generating interpretable explanatory subgraphs intrinsically. We evaluate the methods on the GQA dataset and show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy, while also achieving strong co-occurrences between answer and question tokens. Furthermore, we conduct a human evaluation to assess the interpretability of the generated subgraphs using a comparative setting with the extended Bradley-Terry model, showing that the answer and question token co-occurrence metrics strongly correlate with human preferences. Our source code is publicly available.

Paper Structure

This paper contains 38 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The prior scores $\theta$, which are based on the question and node embeddings, are used to sample a subgraph $z$ that is then used to predict the answer.
  • Figure 2: Model accuracy with respect to batch size.
  • Figure 3: atcoo and qtcoo values with respect to batch size.
  • Figure 4: Accuracy per method across different top-$k$ values and batch sizes.
  • Figure 5: Likert scale responses to the questions about and .
  • ...and 2 more figures