Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Igor Sterner; Weizhe Lin; Jinghong Chen; Bill Byrne

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Igor Sterner, Weizhe Lin, Jinghong Chen, Bill Byrne

TL;DR

The paper addresses how to integrate visual information into frozen LLMs for visual question answering by comparing embedding-based VQA (mapping image features directly to LLM embeddings) with caption-based VQA (generating captions from images before prompting the LLM). Using a fixed image feature space bridged by a non-linear mapping to a frozen Flan-T5 XL 3B and a CLIP ViT-G encoder, the authors evaluate both approaches across zero-shot and few-shot regimes, with systematic in-context example selection. The study finds that caption-based VQA excels in zero-shot scenarios, while few-shot performance is highly sensitive to in-context example selection; embedding-based VQA can outperform caption-based VQA under certain in-context strategies and data configurations, especially for color and counting questions. These results highlight the importance of reporting caption-based baselines in multimodal VQA and provide guidance for building fair benchmarks and future multimodal LLM systems.

Abstract

Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

TL;DR

Abstract

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Authors

TL;DR

Abstract

Table of Contents

Figures (2)