Table of Contents
Fetching ...

Knowledge Generation for Zero-shot Knowledge-based VQA

Rui Cao, Jing Jiang

TL;DR

This paper introduces KGenVQA, a zero-shot knowledge-based visual QA approach that explicitly generates relevant knowledge statements from an LLM and integrates them with image captions and questions to answer questions. By employing a two-stage knowledge generation process and a self-supervised, cluster-based diversification strategy, KGenVQA produces multiple diverse knowledge inputs without additional training. Empirical results on OK-VQA and A-OKVQA show that generated knowledge improves accuracy across multiple QA backbones and can outperform several zero-shot baselines, with human evaluations confirming the quality of the generated knowledge. The work advances interpretability in K-VQA and points to future directions in filtering redundant knowledge and integrating with vision-language models for even tighter multimodal reasoning.

Abstract

Previous solutions to knowledge-based visual question answering~(K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model. Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results. However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability. Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful.

Knowledge Generation for Zero-shot Knowledge-based VQA

TL;DR

This paper introduces KGenVQA, a zero-shot knowledge-based visual QA approach that explicitly generates relevant knowledge statements from an LLM and integrates them with image captions and questions to answer questions. By employing a two-stage knowledge generation process and a self-supervised, cluster-based diversification strategy, KGenVQA produces multiple diverse knowledge inputs without additional training. Empirical results on OK-VQA and A-OKVQA show that generated knowledge improves accuracy across multiple QA backbones and can outperform several zero-shot baselines, with human evaluations confirming the quality of the generated knowledge. The work advances interpretability in K-VQA and points to future directions in filtering redundant knowledge and integrating with vision-language models for even tighter multimodal reasoning.

Abstract

Previous solutions to knowledge-based visual question answering~(K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model. Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results. However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability. Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful.
Paper Structure (32 sections, 2 figures, 15 tables)

This paper contains 32 sections, 2 figures, 15 tables.

Figures (2)

  • Figure 1: Three approaches to K-VQA: retrieve and answer, directly answer, and generate and answer.
  • Figure 2: An overview of the proposed method. We first convert the image into textual descriptions and prompt LLMs with the question and manual demonstrations to obtain the initial knowledge pieces. In the second stage, we diversify knowledge by selecting a diverse set of knowledge statements in the first step as demonstrations. Lastly, we incoporate the generated knowledge for QA with a language model.