Table of Contents
Fetching ...

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

TL;DR

RepARe addresses underspecification in vision-language VQA by generating multiple visually grounded question variants and selecting the most answerable one via unsupervised LVLM confidence. The approach combines salient entities, rationales, and image captions through a two-stage pipeline of rephrasing/augmentation and confidence-based selection, yielding consistent zero-shot gains across VQAv2, A-OKVQA, and VizWiz across multiple LVLMs. Ablation analyses show that rationales, captions, and question entities are critical, and that grounding the question in visual information improves both answerability and the LVLM's ability to leverage its larger LLM component. While paraphrasing can offer strong gains in oracle settings, RepARe demonstrates superior inference performance by integrating visual grounding with targeted question modification, illustrating the value of grounding over purely linguistic rewrites. The method highlights a practical, gradient-free path to improve zero-shot VLQA, though it incurs cost and relies on robust dataset quality and carefully managed prompts to avoid biases.

Abstract

An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to an LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually-grounded information to the input as a preemptive clarification should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. Similarly, in the VQA setting, changing the way questions are framed can make them easier for models to answer. To this end, we present Rephrase, Augment and Reason (RepARe), a gradient-free framework that extracts salient details about the image using the underlying LVLM as a captioner and reasoner, in order to propose modifications to the original question. We then use the LVLM's confidence over a generated answer as an unsupervised scoring function to select the rephrased question most likely to improve zero-shot performance. Focusing on three visual question answering tasks, we show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively. Additionally, we find that using gold answers for oracle question candidate selection achieves a substantial gain in VQA accuracy by up to 14.41%. Through extensive analysis, we demonstrate that outputs from RepARe increase syntactic complexity, and effectively utilize vision-language interaction and the frozen LLM.

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

TL;DR

RepARe addresses underspecification in vision-language VQA by generating multiple visually grounded question variants and selecting the most answerable one via unsupervised LVLM confidence. The approach combines salient entities, rationales, and image captions through a two-stage pipeline of rephrasing/augmentation and confidence-based selection, yielding consistent zero-shot gains across VQAv2, A-OKVQA, and VizWiz across multiple LVLMs. Ablation analyses show that rationales, captions, and question entities are critical, and that grounding the question in visual information improves both answerability and the LVLM's ability to leverage its larger LLM component. While paraphrasing can offer strong gains in oracle settings, RepARe demonstrates superior inference performance by integrating visual grounding with targeted question modification, illustrating the value of grounding over purely linguistic rewrites. The method highlights a practical, gradient-free path to improve zero-shot VLQA, though it incurs cost and relies on robust dataset quality and carefully managed prompts to avoid biases.

Abstract

An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to an LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually-grounded information to the input as a preemptive clarification should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. Similarly, in the VQA setting, changing the way questions are framed can make them easier for models to answer. To this end, we present Rephrase, Augment and Reason (RepARe), a gradient-free framework that extracts salient details about the image using the underlying LVLM as a captioner and reasoner, in order to propose modifications to the original question. We then use the LVLM's confidence over a generated answer as an unsupervised scoring function to select the rephrased question most likely to improve zero-shot performance. Focusing on three visual question answering tasks, we show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively. Additionally, we find that using gold answers for oracle question candidate selection achieves a substantial gain in VQA accuracy by up to 14.41%. Through extensive analysis, we demonstrate that outputs from RepARe increase syntactic complexity, and effectively utilize vision-language interaction and the frozen LLM.
Paper Structure (44 sections, 2 equations, 4 figures, 9 tables)

This paper contains 44 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Top: The original question (in A-OKVQA) lacks information about implicit reasoning, leading to an incorrect answer. RepARe interacts with the LVLM to extract attributes like "tennis players" and "position w.r.t net" that are key to answering the question correctly. Adding these modifiers to the question elicits the correct response from LVLM. Bottom: Underspecified questions from A-OKVQA (left) and VQAv2 (right) datasets along with RepARe outputs.
  • Figure 2: Schematic of RepARe for an image requiring implicit reasoning from A-OKVQA. We first extract keywords, captions, and rationales from the image conditioned on the question, which are used to identify important objects (e.g., day and clock). We query an LVLM about these objects to collect visual details in I(a), that are fused into the original question to produce, in this case, $n=3$ candidates (I(b)). Lastly, we score and select from candidates using LVLM's answer confidence (II).
  • Figure 3: Example images and original questions for \ref{['tab:qualitative']}. Some questions (e.g., "What is behind the boy" are underspecified, while others refer to small objects in the image (e.g., "What color wetsuit is he wearing?").
  • Figure 4: Trends in VQA performance of RepARe for different values of $n$.