Table of Contents
Fetching ...

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Jihyoung Jang, Hyounghun Kim

TL;DR

Ambiguous Visual Question Answering (AQuA) is introduced, a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case.

Abstract

Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

TL;DR

Ambiguous Visual Question Answering (AQuA) is introduced, a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case.

Abstract

Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
Paper Structure (28 sections, 1 equation, 13 figures, 8 tables)

This paper contains 28 sections, 1 equation, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Examples of model responses to an ambiguous visual question. In this image, none of the bats is visually salient, making the visual context ambiguous. While GPT, Gemini, and Qwen provide answers by arbitrarily selecting (e.g., the bat in the foreground) despite the ambiguity, our model, which is trained to handle such cases strategically, requests clarification instead.
  • Figure 2: Examples of the four ambiguity levels in AQuA.
  • Figure 3: Reward assignment process. Since the given image contains multiple vehicles, the correct response is to request clarification. A perfectly accurate clarification receives a reward of 1. If clarification is requested but contains factual error, a 0.3 penalty is applied. All other response types are assigned a reward of 0.
  • Figure 4: Response comparison of Qwen2.5-VL-3B-Instruct in zero-shot and tuned settings.
  • Figure 5: Confusion matrices of the response patterns of Qwen2.5-VL-3B-Instruct on the AQuA.
  • ...and 8 more figures