Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Manas Jhalani; Annervaz K M; Pushpak Bhattacharyya

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Manas Jhalani, Annervaz K M, Pushpak Bhattacharyya

TL;DR

This work tackles KBVQA by dynamically extracting external knowledge from graphs to accompany images and questions. It introduces a dynamic triple-filtering module that selects a variable number of 2-hop triples based on similarity, feeding them into an OFA vision-language transformer for answer generation. Empirical results show average improvements of about 4.75% across KVQA, FVQA, and CRIC-VQA, with ablations confirming the advantage of dynamic over fixed-context triples and demonstrations of cross-domain generalization and benefit to large multimodal LLMs like LLAVA. The approach reduces noise, enhances reasoning, and demonstrates practical impact for real-time, user-centric VQA tasks, while outlining avenues for future end-to-end training and explainability.

Abstract

In the realm of multimodal tasks, Visual Question Answering (VQA) plays a crucial role by addressing natural language questions grounded in visual content. Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions. We introduce an approach for KBVQA, augmenting the existing vision-language transformer encoder-decoder (OFA) model. Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method. We supply a flexible number of triples from the knowledge graph as context, tailored to meet the requirements for answering the question. Our model, enriched with knowledge, demonstrates an average improvement of 4.75\% in Exact Match Score over the state-of-the-art on three different KBVQA datasets. Through experiments and analysis, we demonstrate that furnishing variable triples for each question improves the reasoning capabilities of the language model in contrast to supplying a fixed number of triples. This is illustrated even for recent large language models. Additionally, we highlight the model's generalization capability by showcasing its SOTA-beating performance on a small dataset, achieved through straightforward fine-tuning.

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 3 figures, 16 tables, 1 algorithm)

This paper contains 29 sections, 4 equations, 3 figures, 16 tables, 1 algorithm.

Introduction
Related Work
Datasets:
Our Approach
Triple Filtering Module
Triples Relevant to Entities in Image
Triples Relevant to Entities in Question
Prediction Module
Experimental Setup & Results
Results on KVQA dataset
Ablation Results
Results on CRIC-VQA dataset
Generalisation Capability
Relevance of Knowledge in the Context of MLLMs
Zero-shot Evaluation on the LLAVA model
...and 14 more sections

Figures (3)

Figure 1: Example question answerable solely from an image Shah_Mishra_Yadati_Talukdar_2019, without requiring external information. Question: Who is to the right of R.Madhavan? Named Entities: [Kangana Ranaut, R. Madhavan]
Figure 2: The proposed framework is illustrated in the flow diagram. In the first stage of prediction, triples are filtered based on images, followed by an additional round of filtering based on questions. Finally, the extracted triples in green represent useful triples and the triples in red represent noisy ones. In the second stage of prediction, Relevant Triples, Image Resnet Features and Questions, are fed into a transformer encoder-decoder model (OFA) to generate the predicted answer.$\oplus$ represents the concatenation of all the features to pass it to the transformer encoder-decoder to get the predicted answer. Irrelevant triples are depicted with dashed lines, while relevant triples, filtered based on images, are represented with bold lines.
Figure 3: Splitting the image into four patches to extract relevant triples.

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

TL;DR

Abstract

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)