Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

Xingyu Fu; Ben Zhou; Sihao Chen; Mark Yatskar; Dan Roth

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth

TL;DR

This paper tackles the interpretability gap in visual question answering by introducing Dynamic Clue Bottleneck (dCluB), an interpretable-by-design model that first generates human-readable visual clues and then predicts the answer solely from these clues. The core idea is to factor the prediction into a bottleneck $g(x)$ that produces clues and a final predictor $f$ that uses natural language inference to evaluate candidate answers, ensuring faithfulness from the outset. A 1.7k visual clues dataset supports training and evaluation, and results show that dCluB improves reasoning-focused accuracy by 4.64% while maintaining near-parity with black-box baselines on VQA-v2 and GQA. This approach advances trustworthy, interpretable multimodal reasoning with practical implications for sensitive domains.

Abstract

Recent advances in multimodal large language models (LLMs) have shown extreme effectiveness in visual question answering (VQA). However, the design nature of these end-to-end models prevents them from being interpretable to humans, undermining trust and applicability in critical domains. While post-hoc rationales offer certain insight into understanding model behavior, these explanations are not guaranteed to be faithful to the model. In this paper, we address these shortcomings by introducing an interpretable by design model that factors model decisions into intermediate human-legible explanations, and allows people to easily understand why a model fails or succeeds. We propose the Dynamic Clue Bottleneck Model ( (DCLUB), a method that is designed towards an inherently interpretable VQA system. DCLUB provides an explainable intermediate space before the VQA decision and is faithful from the beginning, while maintaining comparable performance to black-box systems. Given a question, DCLUB first returns a set of visual clues: natural language statements of visually salient evidence from the image, and then generates the output based solely on the visual clues. To supervise and evaluate the generation of VQA explanations within DCLUB, we collect a dataset of 1.7k reasoning-focused questions with visual clues. Evaluations show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions while preserving 99.43% of performance on VQA-v2.

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

TL;DR

that produces clues and a final predictor

that uses natural language inference to evaluate candidate answers, ensuring faithfulness from the outset. A 1.7k visual clues dataset supports training and evaluation, and results show that dCluB improves reasoning-focused accuracy by 4.64% while maintaining near-parity with black-box baselines on VQA-v2 and GQA. This approach advances trustworthy, interpretable multimodal reasoning with practical implications for sensitive domains.

Abstract

Paper Structure (18 sections, 3 equations, 11 figures, 3 tables)

This paper contains 18 sections, 3 equations, 11 figures, 3 tables.

Introduction
Related Works
VQA Interpretability
Textual Interpretability
Dynamic Clue Bottlenecks VQA
Visual Clue Generator $g(x)$
Final Predictor $f$
Visual Clues Dataset Collection
Experiments
Blackbox Counterpart Baselines
Main Results
When does dCluB Succeed or Fail?
Conclusion
Limitations
Appendix
...and 3 more sections

Figures (11)

Figure 1: Design differences between de facto blackbox VQA methods (up), and our proposed dCluB method (bottom). Default models directly generate answers, while dCluB first provides visual clues in the image that could hint an answer, and then decides the answer based soly on the clues.
Figure 2: A detailed illustration of our dCluB system on an example VQA data, with explicit steps of visual clue generation using $g$ and natural language entailment scores from $f$ for final prediction. Answer candidates are pre-given in our setting, and we use top-k answers from the counterpart blackbox model as answer candidates in our experiments.
Figure 3: Examples showing the dynamic nature of dCluB's visual clues: for the same image, different questions should have different visual clues.
Figure 4: Collected visual clues examples in our training data.
Figure 5: Qualitative examples outputs using dCluB. Human annotated visual clues are in the grey boxes under the human icon, and dCluB visual clues are in colored boxes.
...and 6 more figures

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

TL;DR

Abstract

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (11)