Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Junnan Dong; Qinggang Zhang; Huachi Zhou; Daochen Zha; Pai Zheng; Xiao Huang

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, Xiao Huang

TL;DR

A novel modality-aware integration with LLMs for KVQA (MAIL) carefully leverages multimodal knowledge for both image understanding and knowledge reasoning, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums.

Abstract

Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

TL;DR

Abstract

Paper Structure (21 sections, 11 equations, 3 figures, 9 tables)

This paper contains 21 sections, 11 equations, 3 figures, 9 tables.

Introduction
Problem Statement
Methodology
Scene Graph Construction
Concept Graph Construction
Pseudo-siamese Graph Medium Fusion
Training Objective
Joint Optimization
Experiments
Experimental Setup
Main Results
Hyperparameter Analysis
Ablation Studies
Case Studies
Related Work
...and 6 more sections

Figures (3)

Figure 1: A sketched comparison on employing LLMs for KVQA between existing learning paradigms and ours.
Figure 2: Our proposed framework MAIL, a novel modality-aware integration for knowledge-based VQA with LLMs. Nodes in blue stand for external knowledge, while red is for visual objects and yellow shows the topic entities from questions. Blue nodes with red dashed borders indicate the extracted mediums in concept graph. MAIL is trained to integrate multimodal information for comprehensive cross-modal reasoning with a tailored PS-GMF.
Figure 3: Case studies with both single-hop and multi-hop reasoning examples in OK-VQA.

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

TL;DR

Abstract

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (3)