Table of Contents
Fetching ...

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Yanan Wang, Michihiro Yasunaga, Hongyu Ren, Shinya Wada, Jure Leskovec

TL;DR

VQA-GNN addresses the limitation of unidirectional fusion between unstructured QA context and structured multimodal knowledge in VQA by introducing bidirectional fusion through a multimodal semantic graph. It interconnects scene graphs and concept graphs via QA-context and QA-concept nodes and employs modality-specific GNNs to perform inter-modal message passing, enabling deeper concept-level reasoning. Evaluations on VCR and GQA show improvements of 3.2% and 4.6%, respectively, with ablations validating the two core ideas: bidirectional fusion and multimodal GNNs. The approach demonstrates that jointly reasoning over unstructured and structured knowledge can reduce reliance on large-scale pretraining while enhancing reasoning capabilities for VQA.

Abstract

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

TL;DR

VQA-GNN addresses the limitation of unidirectional fusion between unstructured QA context and structured multimodal knowledge in VQA by introducing bidirectional fusion through a multimodal semantic graph. It interconnects scene graphs and concept graphs via QA-context and QA-concept nodes and employs modality-specific GNNs to perform inter-modal message passing, enabling deeper concept-level reasoning. Evaluations on VCR and GQA show improvements of 3.2% and 4.6%, respectively, with ablations validating the two core ideas: bidirectional fusion and multimodal GNNs. The approach demonstrates that jointly reasoning over unstructured and structured knowledge can reduce reliance on large-scale pretraining while enhancing reasoning capabilities for VQA.

Abstract

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.
Paper Structure (14 sections, 11 equations, 5 figures, 5 tables)

This paper contains 14 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of VQA-GNN. Given an image and QA sentence, we obtain unstructured knowledge (e.g., QA-concept node p and QA-context node z) and structured knowledge (e.g., scene-graph and concept-graph), and then unify them to perform bidirectional fusion for visual question answering.
  • Figure 2: Reasoning procedure of VQA-GNN. We first build a multimodal semantic graph for each given image-QA pair to unify unstructured (e.g., "node p" and "node z") and structured (e.g., "scene-graph" and "concept-graph") multimodal knowledge (§ \ref{['semantic-graph']}). Then we perform inter-modal message passing with a multimodal GNN-based bidirectional fusion method (§ \ref{['gnn']}) to update the representations of node $z$, $p$, $v_i$ and $c_i$ for $k+1$ iterations in two steps. Finally, we predict the answer with these updated various node representations (§ \ref{['infer']}). Here, "S" and "C" indicate scene-graph and concept-graph respectively. "LM_encoder" indicates a language model used to finetune QA-context node representation, and "GNN" indicates a relation-graph neural network for iterative message passing.
  • Figure 3: The process of concept-graph retrieval involves the calculation of similarity between concept-graph nodes and the answer context, denoted as $Relev(e|a)$.
  • Figure 4: Ablation architectures. We find that our final VQA-GNN architecture with two modality-specialized GNNs overcomes the representation gaps between modalities (§ \ref{['para:modality_gap']}).
  • Figure 5: Illustration of two knowledge fusion methods: our proposed bidirectional fusion v.s. the unidirectional fusion baseline.