Table of Contents
Fetching ...

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, Steven C. H. Hoi

TL;DR

The paper tackles zero-shot VQA by eliminating end-to-end vision-language fine-tuning and instead leveraging frozen large language models guided by image-derived prompts. It introduces Img2LLM, a plug-and-play pipeline that converts image content into synthetic question-answer exemplars and question-relevant captions to prompt LLMs in-context, in a way that bridges both modality and task gaps. Empirical results show state-of-the-art zero-shot performance on VQAv2, OK-VQA, and A-OK-VQA among frozen-LM methods, with clear scaling benefits as LLM size increases. Extensive ablations reveal the impact of question generation, caption selection, and prompt design on performance, while also noting computational overhead as a limitation. Overall, the method provides a flexible, low-cost path to deploy powerful VQA capabilities with upgradable LLMs without end-to-end training.

Abstract

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

TL;DR

The paper tackles zero-shot VQA by eliminating end-to-end vision-language fine-tuning and instead leveraging frozen large language models guided by image-derived prompts. It introduces Img2LLM, a plug-and-play pipeline that converts image content into synthetic question-answer exemplars and question-relevant captions to prompt LLMs in-context, in a way that bridges both modality and task gaps. Empirical results show state-of-the-art zero-shot performance on VQAv2, OK-VQA, and A-OK-VQA among frozen-LM methods, with clear scaling benefits as LLM size increases. Extensive ablations reveal the impact of question generation, caption selection, and prompt design on performance, while also noting computational overhead as a limitation. Overall, the method provides a flexible, low-cost path to deploy powerful VQA capabilities with upgradable LLMs without end-to-end training.

Abstract

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
Paper Structure (28 sections, 2 equations, 9 figures, 12 tables)

This paper contains 28 sections, 2 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: The illustrative comparison of three tyepes of methods that enable LLM to perform VQA tasks, where blue block denotes that the the inner parameters are frozen while pink block indicates the inner parameters are trainable.
  • Figure 2: The overall pipeline of Img2LLM, including Caption Prompt and Exemplar Prompt generation.
  • Figure 3: Example predictions made by Img2LLM. Specifically, (a) and (b) are successful cases, while (c) and (d) are failure cases. See more examples at Appendix A.5.
  • Figure 4: Success case analysis for OK-VQA. Green color indicates answer cues and correct prediction.
  • Figure 5: Failure case analysis for OK-VQA. Red color indicates incorrect prediction.
  • ...and 4 more figures