Table of Contents
Fetching ...

OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description

Quanxing Xu, Ling Zhou, Feifei Zhang, Jinyu Tian, Rubing Huang

TL;DR

This work tackles language bias and poor domain generalization in large-language-model–based VQA by introducing OAD-Promoter, a zero-shot framework that uses multi-granularity object-attribute captions and memory-augmented knowledge. It comprises three components: OEG to enrich visual input with global and object-focused descriptions, MKA to retrieve memory-based support for unseen domains, and the OAD Prompt to coherently fuse these signals for LLM inference. The approach demonstrates strong zero-shot and domain-shift performance across multiple benchmarks, achieving state-of-the-art results on VQAv2 and showing robust generalization to distribution shifts (e.g., VQA-CP, GQA-OOD) while remaining compatible with diverse LLMs. Overall, OAD-Promoter offers a practical, data-free zero-shot pathway to more reliable and domain-adaptive VQA with LLMs.

Abstract

Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.

OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description

TL;DR

This work tackles language bias and poor domain generalization in large-language-model–based VQA by introducing OAD-Promoter, a zero-shot framework that uses multi-granularity object-attribute captions and memory-augmented knowledge. It comprises three components: OEG to enrich visual input with global and object-focused descriptions, MKA to retrieve memory-based support for unseen domains, and the OAD Prompt to coherently fuse these signals for LLM inference. The approach demonstrates strong zero-shot and domain-shift performance across multiple benchmarks, achieving state-of-the-art results on VQAv2 and showing robust generalization to distribution shifts (e.g., VQA-CP, GQA-OOD) while remaining compatible with diverse LLMs. Overall, OAD-Promoter offers a practical, data-free zero-shot pathway to more reliable and domain-adaptive VQA with LLMs.

Abstract

Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The illustration of the problem in existing LLM-based KBVQA. Like conventional VQA models, LLMs tend to exploit the inside language bias when they conduct inference. This drawback hampers both their accuracy and domain adaptation capabilities.
  • Figure 2: Architecture of OAD-Promoter. It comprises three components: 1) The OEG module (green box) generates a global caption and object-focused samples; 2) The MKA module (blue box) processes novel inputs by leveraging relevant stored examples to assist the LLM; and 3) The OAD Prompt (red box) integrates outputs from the preceding modules and directs the LLM to produce the final answer.
  • Figure 3: The illustration of the detailed process in the OEG module. Specifically, the BLIP2 and VinVL are used to produce the global caption and object-concentrated captions, respectively. The generated questions are output by a pre-trained T5-large model via prompting.
  • Figure 4: Qualitative analysis of the proposed method. Four cases from distinct domains are displayed.