Table of Contents
Fetching ...

Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model

Taehee Kim, Yeongjae Cho, Heejun Shin, Yohan Jo, Dongmyung Shin

TL;DR

CoQAH addresses the generalization gap from synthetic to human-written VQA by enabling a chain-of-QA between a domain-specialized VQA model and a large language model. The LLM generates template-based questions, probes a synthetic VQA model, and uses the dialogue to iteratively refine its answer, with an existence/uniqueness handler ensuring logical consistency. It achieves state-of-the-art performance on CLEVR-Human, VQA-RAD, and SLAKE without finetuning, outperforming general VLMs and template-based baselines and narrowing the gap to finetuned models. The approach also provides interpretable rationales for its answers, suggesting strong potential for scalable VQA in specialized domains where annotated human data are scarce.

Abstract

Visual question answering (VQA) is a task where an image is given, and a series of questions are asked about the image. To build an efficient VQA algorithm, a large amount of QA data is required which is very expensive. Generating synthetic QA pairs based on templates is a practical way to obtain data. However, VQA models trained on those data do not perform well on complex, human-written questions. To address this issue, we propose a new method called {\it chain of QA for human-written questions} (CoQAH). CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions. We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images and found that it achieved state-of-the-art accuracy in both types of data. Notably, CoQAH outperformed general vision-language models, VQA models, and medical foundation models with no finetuning.

Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model

TL;DR

CoQAH addresses the generalization gap from synthetic to human-written VQA by enabling a chain-of-QA between a domain-specialized VQA model and a large language model. The LLM generates template-based questions, probes a synthetic VQA model, and uses the dialogue to iteratively refine its answer, with an existence/uniqueness handler ensuring logical consistency. It achieves state-of-the-art performance on CLEVR-Human, VQA-RAD, and SLAKE without finetuning, outperforming general VLMs and template-based baselines and narrowing the gap to finetuned models. The approach also provides interpretable rationales for its answers, suggesting strong potential for scalable VQA in specialized domains where annotated human data are scarce.

Abstract

Visual question answering (VQA) is a task where an image is given, and a series of questions are asked about the image. To build an efficient VQA algorithm, a large amount of QA data is required which is very expensive. Generating synthetic QA pairs based on templates is a practical way to obtain data. However, VQA models trained on those data do not perform well on complex, human-written questions. To address this issue, we propose a new method called {\it chain of QA for human-written questions} (CoQAH). CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions. We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images and found that it achieved state-of-the-art accuracy in both types of data. Notably, CoQAH outperformed general vision-language models, VQA models, and medical foundation models with no finetuning.
Paper Structure (29 sections, 13 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Human-written questions, compared to the fixed format of the template-based questions, include more complex free-form questions, such as ones requiring reasoning. (b), (c) Example cases where a template-based VQA model fails to answer human-written questions correctly.
  • Figure 1: Template format for the chest X-ray VQA dataset, MIMIC-Diff-VQA. A question is only composed of two parts, $\langle type \rangle$ and $\langle abnormality \rangle$. Similar to the case of the CLEVR dataset, we asked an LLM to select an option available for each $\langle type \rangle$ and $\langle abnormality \rangle$ to generate a template-based question.
  • Figure 2: An overview of the proposed CoQAH method. (a) An example of a task instruction is shown at the top. The figure on the left describes an overall interaction process between an LLM and a template-based VQA model to reach the final answer to the user question. On the right, the figure represents an example of dialogue between the two models. (b) The template format for the questions in the CLEVR dataset is described, which is composed of several different entities ($\langle question \rangle = \langle type \rangle + \langle object \rangle + \langle relation \rangle + \langle object \rangle$). For each entity, a few options are available to be selected (e.g., small or large or $\langle Empty \rangle$ for $\langle Size \rangle$ entity).
  • Figure 2: Task instruction of CoQAH for CLEVR.
  • Figure 3: Illustration of how the existence and uniqueness handler (EUH) can prevent an LLM from concluding an incorrect answer for a given user question. (a) An example case includes a question, an answer, and an image. (b) A dialogue between an LLM and a VQA model when the existence of an object is violated. With EUH, the VQA model checks the presence of an object and lets LLM know its existence. (c) A dialogue when the uniqueness of an object is violated. With EUH, the VQA model successfully lets LLM know the number of objects satisfying the same condition.
  • ...and 8 more figures