Table of Contents
Fetching ...

Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering

Zhou Yu, Xuecheng Ouyang, Zhenwei Shao, Meng Wang, Jun Yu

TL;DR

Prophet reframes knowledge-based VQA as a prompting task for LLMs by deriving two complementary heuristics—answer candidates and answer-aware examples—from a vanilla VQA model. These heuristics are used to create a richer, task-specific prompt that activates the LLM’s reasoning without requiring end-to-end retraining. Across four diverse datasets, Prophet and its multimodal extension Prophet++ achieve strong results, often outperforming prior GPT-3-based methods and delivering competitive gains with modern LMMs like GPT-4o. The work demonstrates a flexible, generalizable framework for leveraging external knowledge via prompting in vision-language tasks.

Abstract

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the \emph{blind} LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. The two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. Prophet is general that can be instantiated with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones). Moreover, Prophet can also be integrated with modern large multimodal models in different stages, which is named Prophet++, to further improve the capabilities on knowledge-based VQA tasks.

Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering

TL;DR

Prophet reframes knowledge-based VQA as a prompting task for LLMs by deriving two complementary heuristics—answer candidates and answer-aware examples—from a vanilla VQA model. These heuristics are used to create a richer, task-specific prompt that activates the LLM’s reasoning without requiring end-to-end retraining. Across four diverse datasets, Prophet and its multimodal extension Prophet++ achieve strong results, often outperforming prior GPT-3-based methods and delivering competitive gains with modern LMMs like GPT-4o. The work demonstrates a flexible, generalizable framework for leveraging external knowledge via prompting in vision-language tasks.

Abstract

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the \emph{blind} LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. The two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. Prophet is general that can be instantiated with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones). Moreover, Prophet can also be integrated with modern large multimodal models in different stages, which is named Prophet++, to further improve the capabilities on knowledge-based VQA tasks.
Paper Structure (22 sections, 13 equations, 7 figures, 17 tables)

This paper contains 22 sections, 13 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Conceptual comparisons of three knowledge-based VQA frameworks using a frozen LLM model, e.g., GPT-3 gpt3. While PICa pica, KAT kat, and REVIVE revive directly feed the caption (C) and question (Q) into the LLM as the prompt, we argue that the information they provide for the LLM is insufficient thus cannot fully activate the LLM's potential. In contrast, our Prophet learns a vanilla VQA model without external knowledge to produce answer heuristics, which endows the LLM with richer and more task-specific information for answer prediction. In contrast to the counterparts that resort to specific VQA models and LLMs, Prophet is general that can be instantiated with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones). Moreover, Prophet can also be integrated with large multimodal models (LMMs) in different stages, which is termed Prophet++, to further improve the capabilities on knowledge-based VQA tasks.
  • Figure 2: Our Prophet framework has two stages: answer heuristics generation and heuristics-enhanced prompting. In the answer heuristics generation stage, a vanilla VQA model trained on specific knowledge-based VQA dataset is employed to generate two types of complementary answer heuristics, i.e., answer candidates and answer-aware examples. In the heuristics-enhanced prompting stage, the answer heuristics, question, and caption are integrated into a formatted prompt to instruct a frozen LLM (e.g., GPT-3) to predict an answer. As shown in the example, both answer heuristics contribute to the answer of "helium".
  • Figure 3: Discriminative vs. generative VQA models. Taking an image (V) and a question (Q) as inputs, a typical discriminative VQA model performs multi-class classification to predict the most relevant answer (may contain multiple words) from a predefined answer vocabulary, while a typical generative VQA model iteratively predicts one answer word at a time to constitute the final answer.
  • Figure 4: The Prophet++ framework additionally introduces large multimodal models (LMMs) in different stages of Prophet. Specifically, the LMM in stage-1 is used to generate a new type of answer heuristic, i.e., the answer-aware rationales while the LMM in stage-2 is used to handle an extra visual input (V), thus providing more comprehensive knowledge to answer the question.
  • Figure 5: Prophet's prediction behaviors in terms of (a) distribution and (b) per-type accuracy. As Prophet takes $K$ answer candidates from MCAN as inputs, we define three prediction behaviors of Prophet, namely "keep top 1", "in top 2-$K$", and "beyond top $K$" predictions of MCAN, respectively. Note that all the testing samples can be categorized into one of the three classes above.
  • ...and 2 more figures