Multimodal Multihop Source Retrieval for Web Question Answering

Navya Yarrabelly; Saloni Mittal

Multimodal Multihop Source Retrieval for Web Question Answering

Navya Yarrabelly, Saloni Mittal

TL;DR

The paper tackles multimodal multihop source retrieval for open-domain QA by introducing a Hierarchical Graph Network (HGN) that constructs and reasons over multimodal sources. It compares multiple graph topologies (star, fully connected, and entity-based hierarchical graphs) and employs GraphSAGE-based message passing with node/edge supervision plus a contrastive objective. Using CLIP and sBERT features, the approach achieves notable gains (e.g., ~4.6 percentage points in F1) over transformer baselines while being lighter and more scalable. The results highlight the value of structured graph priors for multimodal retrieval, while also identifying challenges in full-scale retrieval and the potential of graph attention architectures for further improvements.

Abstract

This work deals with the challenge of learning and reasoning over multi-modal multi-hop question answering (QA). We propose a graph reasoning network based on the semantic structure of the sentences to learn multi-source reasoning paths and find the supporting facts across both image and text modalities for answering the question. In this paper, we investigate the importance of graph structure for multi-modal multi-hop question answering. Our analysis is centered on WebQA. We construct a strong baseline model, that finds relevant sources using a pairwise classification task. We establish that, with the proper use of feature representations from pre-trained models, graph structure helps in improving multi-modal multi-hop question answering. We point out that both graph structure and adjacency matrix are task-related prior knowledge, and graph structure can be leveraged to improve the retrieval performance for the task. Experiments and visualized analysis demonstrate that message propagation over graph networks or the entire graph structure can replace massive multimodal transformers with token-wise cross-attention. We demonstrated the applicability of our method and show a performance gain of \textbf{4.6$\%$} retrieval F1score over the transformer baselines, despite being a very light model. We further demonstrated the applicability of our model to a large scale retrieval setting.

Multimodal Multihop Source Retrieval for Web Question Answering

TL;DR

Abstract

} retrieval F1score over the transformer baselines, despite being a very light model. We further demonstrated the applicability of our model to a large scale retrieval setting.

Paper Structure (35 sections, 5 equations, 6 figures, 5 tables)

This paper contains 35 sections, 5 equations, 6 figures, 5 tables.

Introduction
Related Work
multimodal visual Q/A
Multihop QA
Cross modality representations
Problem Statement
Baseline Models
VLP + VinVL
CLIP + Sentence-BERT based baseline
Proposed Approach
Methods
Graph Construction Module
Star Node Structure
Fully Connected Graph Structure
Hierarchical Semantic Graph Networks
...and 20 more sections

Figures (6)

Figure 1: Top: Sample query; Mid: Possible Sources; Bottom: Desired response
Figure 2: We disentangle the question node $Q$ in yellow. There are considerably less edges in this architecture. This example suggests positive text snippets containing relevant information
Figure 3: Here we build upon the star-based architecture by adding source-source dense connections. Full connections between sources ensure that information relevant to make decision for a node is available in a single hop.
Figure 4: Entity based Hierarchical Graph Network. Nodes in yellow, represent Questions, red denotes distractor sources and green indicates postive sources and nodes in grey are the entity nodes for each of the source and question nodes.
Figure 5: The above two plots show the dot-product distribution between Question and source embeddings. The above graph represents the distribution of Question-Negative source pairs, while the graph below represents the distribution for Question-positive source pairs. The histogram in blue is for similarities obtained from using pre-trained sBERT embeddings and in orange is using graph node embeddings
...and 1 more figures

Multimodal Multihop Source Retrieval for Web Question Answering

TL;DR

Abstract

Multimodal Multihop Source Retrieval for Web Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)