Table of Contents
Fetching ...

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

Wenbin An, Feng Tian, Jiahao Nie, Wenkai Shi, Haonan Lin, Yan Chen, QianYing Wang, Yaqiang Wu, Guang Dai, Ping Chen

TL;DR

DKA introduces a training-free, disentangled approach to knowledge-based VQA by using LLM feedback to split complex questions into two sub-questions: an image-based prompt for caption generation and a knowledge-based prompt for external retrieval. This decoupled pipeline combines image-driven captions, targeted external knowledge, and in-context learning, then ensembles multiple outputs for robust answers. Experiments on OK-VQA and AOK-VQA demonstrate state-of-the-art or competitive performance with fewer training requirements, highlighting improved knowledge alignment and reduced retrieval noise. The method emphasizes interpretability and modularity, enabling targeted improvements to individual components without re-training the entire system. Limitations include caption-model limitations and reliance on external tools; future work may explore alternative LLMs and caption retrievers to further boost accuracy.

Abstract

Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{https://github.com/Lackel/DKA}.

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

TL;DR

DKA introduces a training-free, disentangled approach to knowledge-based VQA by using LLM feedback to split complex questions into two sub-questions: an image-based prompt for caption generation and a knowledge-based prompt for external retrieval. This decoupled pipeline combines image-driven captions, targeted external knowledge, and in-context learning, then ensembles multiple outputs for robust answers. Experiments on OK-VQA and AOK-VQA demonstrate state-of-the-art or competitive performance with fewer training requirements, highlighting improved knowledge alignment and reduced retrieval noise. The method emphasizes interpretability and modularity, enabling targeted improvements to individual components without re-training the entire system. Limitations include caption-model limitations and reliance on external tools; future work may explore alternative LLMs and caption retrievers to further boost accuracy.

Abstract

Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{https://github.com/Lackel/DKA}.
Paper Structure (25 sections, 7 equations, 5 figures, 7 tables)

This paper contains 25 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Green lines: previous methods acquire knowledge in a coupled manner with only forward process. Brown lines: our model disentangles knowledge acquisition by introducing LLM feedback.
  • Figure 2: The overall architecture of our model.
  • Figure 3: Heatmap visualization with GradCAM.
  • Figure 4: Examples with (w/) and without (w/o) disentanglement. $Q$, $C$, $K$, $A$, $q_{i}$, $q_{k}$ represent the question, caption, retrieved knowledge, answer, image-based sub-question, knowledge-based sub-question, respectively. Some important elements are marked in red.
  • Figure 5: Examples with disentanglement. $Q$, $C$, $K$, $A$, $q_{i}$, $q_{k}$ represent the question, caption, retrieved knowledge, answer, image-based sub-question, knowledge-based sub-question, respectively.