From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, Steven C. H. Hoi
TL;DR
The paper tackles zero-shot VQA by eliminating end-to-end vision-language fine-tuning and instead leveraging frozen large language models guided by image-derived prompts. It introduces Img2LLM, a plug-and-play pipeline that converts image content into synthetic question-answer exemplars and question-relevant captions to prompt LLMs in-context, in a way that bridges both modality and task gaps. Empirical results show state-of-the-art zero-shot performance on VQAv2, OK-VQA, and A-OK-VQA among frozen-LM methods, with clear scaling benefits as LLM size increases. Extensive ablations reveal the impact of question generation, caption selection, and prompt design on performance, while also noting computational overhead as a limitation. Overall, the method provides a flexible, low-cost path to deploy powerful VQA capabilities with upgradable LLMs without end-to-end training.
Abstract
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
