Knowledge Generation for Zero-shot Knowledge-based VQA
Rui Cao, Jing Jiang
TL;DR
This paper introduces KGenVQA, a zero-shot knowledge-based visual QA approach that explicitly generates relevant knowledge statements from an LLM and integrates them with image captions and questions to answer questions. By employing a two-stage knowledge generation process and a self-supervised, cluster-based diversification strategy, KGenVQA produces multiple diverse knowledge inputs without additional training. Empirical results on OK-VQA and A-OKVQA show that generated knowledge improves accuracy across multiple QA backbones and can outperform several zero-shot baselines, with human evaluations confirming the quality of the generated knowledge. The work advances interpretability in K-VQA and points to future directions in filtering redundant knowledge and integrating with vision-language models for even tighter multimodal reasoning.
Abstract
Previous solutions to knowledge-based visual question answering~(K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model. Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results. However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability. Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful.
