Table of Contents
Fetching ...

SADL: An Effective In-Context Learning Method for Compositional Visual QA

Long Hoang Dang, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran

TL;DR

Large vision-language models enable in-context learning for Visual QA using a few demonstrations, but prompts for compositional reasoning remain poorly understood. We introduce SADL, a training-free prompting framework built from three components—Sampling, Deliberation, and Pseudo-Labeling—that selects query-specific demonstrations, decomposes complex questions into subquestions via an external LLM, and iteratively labels subquestions to guide the LVLM with $k=2$ demonstrations. Across GQA, GQA-OOD, CLEVR, and CRIC, SADL outperforms vanilla prompts, CoT, and L2M, with especially strong gains in out-of-distribution and highly compositional questions; ablations confirm the critical roles of neighborhood sampling, decomposition, and precise subquestion-label alignment. The results provide practical insights for vision-language in-context learning and offer a scalable, inference-only approach to improving compositional Visual QA.

Abstract

Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. When prompted with a few demonstrations of image-question-answer triplets, LVLMs have demonstrated the ability to discern underlying patterns and transfer this latent knowledge to answer new questions about unseen images without the need for expensive supervised fine-tuning. However, designing effective vision-language prompts, especially for compositional questions, remains poorly understood. Adapting language-only ICL techniques may not necessarily work because we need to bridge the visual-linguistic semantic gap: Symbolic concepts must be grounded in visual content, which does not share the syntactic linguistic structures. This paper introduces SADL, a new visual-linguistic prompting framework for the task. SADL revolves around three key components: SAmpling, Deliberation, and Pseudo-Labeling of image-question pairs. Given an image-question query, we sample image-question pairs from the training data that are in semantic proximity to the query. To address the compositional nature of questions, the deliberation step decomposes complex questions into a sequence of subquestions. Finally, the sequence is progressively annotated one subquestion at a time to generate a sequence of pseudo-labels. We investigate the behaviors of SADL under OpenFlamingo on large-scale Visual QA datasets, namely GQA, GQA-OOD, CLEVR, and CRIC. The evaluation demonstrates the critical roles of sampling in the neighborhood of the image, the decomposition of complex questions, and the accurate pairing of the subquestions and labels. These findings do not always align with those found in language-only ICL, suggesting fresh insights in vision-language settings.

SADL: An Effective In-Context Learning Method for Compositional Visual QA

TL;DR

Large vision-language models enable in-context learning for Visual QA using a few demonstrations, but prompts for compositional reasoning remain poorly understood. We introduce SADL, a training-free prompting framework built from three components—Sampling, Deliberation, and Pseudo-Labeling—that selects query-specific demonstrations, decomposes complex questions into subquestions via an external LLM, and iteratively labels subquestions to guide the LVLM with demonstrations. Across GQA, GQA-OOD, CLEVR, and CRIC, SADL outperforms vanilla prompts, CoT, and L2M, with especially strong gains in out-of-distribution and highly compositional questions; ablations confirm the critical roles of neighborhood sampling, decomposition, and precise subquestion-label alignment. The results provide practical insights for vision-language in-context learning and offer a scalable, inference-only approach to improving compositional Visual QA.

Abstract

Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. When prompted with a few demonstrations of image-question-answer triplets, LVLMs have demonstrated the ability to discern underlying patterns and transfer this latent knowledge to answer new questions about unseen images without the need for expensive supervised fine-tuning. However, designing effective vision-language prompts, especially for compositional questions, remains poorly understood. Adapting language-only ICL techniques may not necessarily work because we need to bridge the visual-linguistic semantic gap: Symbolic concepts must be grounded in visual content, which does not share the syntactic linguistic structures. This paper introduces SADL, a new visual-linguistic prompting framework for the task. SADL revolves around three key components: SAmpling, Deliberation, and Pseudo-Labeling of image-question pairs. Given an image-question query, we sample image-question pairs from the training data that are in semantic proximity to the query. To address the compositional nature of questions, the deliberation step decomposes complex questions into a sequence of subquestions. Finally, the sequence is progressively annotated one subquestion at a time to generate a sequence of pseudo-labels. We investigate the behaviors of SADL under OpenFlamingo on large-scale Visual QA datasets, namely GQA, GQA-OOD, CLEVR, and CRIC. The evaluation demonstrates the critical roles of sampling in the neighborhood of the image, the decomposition of complex questions, and the accurate pairing of the subquestions and labels. These findings do not always align with those found in language-only ICL, suggesting fresh insights in vision-language settings.
Paper Structure (24 sections, 3 equations, 8 figures, 5 tables)

This paper contains 24 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We study the prompting techniques for vision-language reasoning tasks to bridge the gap of compositional reasoning and LVLM inference-free reasoning.
  • Figure 2: Overview of the $\textrm{SADL}$ framework. Advancing the direction of in-context learning for VQA, we introduce three new components (dark green boxes) for effective compositional reasoning using LVLM. The Proximal Sampling uses L-V similarity to select the query-specific samples. Their prioritized list of sub-questions are constructed by Question Decomposition and used to guide the compositional reasoning in the Pseudo-labeling step, providing the effective context for query answering.
  • Figure 3: Example of our prompt to instruct the Vicuna 13B model to decompose the complex compositional question.
  • Figure 4: Example of vanilla few-shot prompting (two-shot setting)
  • Figure 5: Example of chain-of-thought prompting (two-shot setting)
  • ...and 3 more figures