Table of Contents
Fetching ...

CommVQA: Situating Visual Question Answering in Communicative Contexts

Nandita Shankar Naik, Christopher Potts, Elisa Kreiss

TL;DR

CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear, and follow-up questions and answers conditioned on the scenario and description, is introduced.

Abstract

Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario and description. CommVQA, which contains 1000 images and 8,949 question-answer pairs, poses a challenge for current models. Error analyses and a human-subjects study suggest that generated answers still contain high rates of hallucinations, fail to fittingly address unanswerable questions, and don't suitably reflect contextual information. Overall, we show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model and highlighting the relevance of situating systems within communicative scenarios.

CommVQA: Situating Visual Question Answering in Communicative Contexts

TL;DR

CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear, and follow-up questions and answers conditioned on the scenario and description, is introduced.

Abstract

Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario and description. CommVQA, which contains 1000 images and 8,949 question-answer pairs, poses a challenge for current models. Error analyses and a human-subjects study suggest that generated answers still contain high rates of hallucinations, fail to fittingly address unanswerable questions, and don't suitably reflect contextual information. Overall, we show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model and highlighting the relevance of situating systems within communicative scenarios.
Paper Structure (37 sections, 6 figures, 5 tables)

This paper contains 37 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the CommVQA Dataset Construction. Images were sourced from Wikipedia and paired with relevant scenarios. The description were first generated by GPT-4V, then edited by humans. Other participants then provided questions and answers based on the scenario and description, resulting in at least three answers for each of the 2,983 unique visual questions. Simplified instructions are shown here; full details are in Appendix \ref{['sec:appendix']}.
  • Figure 2: Heatmap of BERT Classification Accuracy Across Scenario Pairs. When fine-tuned on different scenario pairs, BERT exhibits varying performance in its ability to distinguish between these scenarios. For instance, BERT achieved 94% accuracy when distinguishing between science magazines and shopping, but only $83\%$ accuracy for travel and social media.
  • Figure 3: Example of Context Dependency in Answer Generation. In this example, the question explicitly asks for content that is not in the description. While the human-elicited answers do not repeat information in the description, IDEFICS (contextual) provides an answer, but does repeat content that is in the description.
  • Figure 4: SBert Cosine Similarity Analysis in Human and Model Responses. Significance levels are marked with asterisks based on a two-sample t-test analysis.
  • Figure 5: CommVQA Dataset Examples. Four example entries from the CommVQA dataset, each paired with a randomly selected answer.
  • ...and 1 more figures