Table of Contents
Fetching ...

Fully Authentic Visual Question Answering Dataset from Online Communities

Chongyan Chen, Mengchen Liu, Noel Codella, Yunsheng Li, Lu Yuan, Danna Gurari

TL;DR

This work introduces VQAonline, the first fully authentic Visual Question Answering dataset whose entire content—questions, context, images, and accepted answers—originates from real use on Stack Exchange. The dataset emphasizes long-form, paragraph-style answers (mean ~173 words) with rich contextual data, challenging standard VQA metrics designed for brief responses. The authors benchmark six modern Vision-Language Models using long-form evaluation metrics and a LLaMA2-based human-alignment metric, revealing substantial room for improvement and that context and topic significantly influence performance. They also conduct a rigorous human evaluation with domain experts to assess model outputs and the alignment of automatic metrics with human judgments, finding strong correlations for METEOR and BERTScore and varying alignment for image-based metrics. The work provides extensive supplementary materials and discusses benefits and limitations of incorporating user-intention signals, aiming to guide future research in authentic, context-rich VQA and evaluation methodology.

Abstract

Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We characterize this dataset and how it relates to eight mainstream VQA datasets. Observing that answers in our dataset tend to be much longer (i.e., a mean of 173 words) and so incompatible with standard VQA evaluation metrics, we instead utilize popular metrics for longer text evaluation for evaluating six state-of-the-art VQA models on VQAonline and report where they struggle most. Finally, we analyze which evaluation metrics align best with human judgments. To facilitate future extensions, we publicly-share the dataset at: https://vqaonline.github.io/.

Fully Authentic Visual Question Answering Dataset from Online Communities

TL;DR

This work introduces VQAonline, the first fully authentic Visual Question Answering dataset whose entire content—questions, context, images, and accepted answers—originates from real use on Stack Exchange. The dataset emphasizes long-form, paragraph-style answers (mean ~173 words) with rich contextual data, challenging standard VQA metrics designed for brief responses. The authors benchmark six modern Vision-Language Models using long-form evaluation metrics and a LLaMA2-based human-alignment metric, revealing substantial room for improvement and that context and topic significantly influence performance. They also conduct a rigorous human evaluation with domain experts to assess model outputs and the alignment of automatic metrics with human judgments, finding strong correlations for METEOR and BERTScore and varying alignment for image-based metrics. The work provides extensive supplementary materials and discusses benefits and limitations of incorporating user-intention signals, aiming to guide future research in authentic, context-rich VQA and evaluation methodology.

Abstract

Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We characterize this dataset and how it relates to eight mainstream VQA datasets. Observing that answers in our dataset tend to be much longer (i.e., a mean of 173 words) and so incompatible with standard VQA evaluation metrics, we instead utilize popular metrics for longer text evaluation for evaluating six state-of-the-art VQA models on VQAonline and report where they struggle most. Finally, we analyze which evaluation metrics align best with human judgments. To facilitate future extensions, we publicly-share the dataset at: https://vqaonline.github.io/.
Paper Structure (63 sections, 12 figures, 19 tables)

This paper contains 63 sections, 12 figures, 19 tables.

Figures (12)

  • Figure 1: VQA examples from our VQAonline dataset and three mainstream VQA datasets balanced_vqa_v2singh2019towardssaikh2022scienceqa. VQAonline is the first VQA dataset to originate from an authentic use case end to end, including with authentic context, answers, and topics/categories labels. It also is the first VQA dataset from an online question answering platform. A critical distinction of VQAonline, which necessitates a new evaluation methodology, is its lengthy answers. (Q=question, C=context, A=answer).
  • Figure 2: Number of visual questions per topic (in log scale). The colors, as defined in the legend, indicate the super-category for each topic.
  • Figure 3: Examples of three visual questions with three different user intents.
  • Figure 4: Performance of the two top-performing VQA models, mPLUG-Owl and LLaVA, for each of 105 topics with their five super-categories represented in 5 different colors. Results are shown with respect to three evaluation metrics: (a) METEOR, (b) BERTscore, (c) RefCLIP. For visualization simplicity, we show text labels only for topics with interesting identified trends (We omitted "language" for each language topic, such as "Esperanto" instead of "Esperanto Language", and we omitted topics with less than 10 data points, such as Esperanto and Community Building.
  • Figure 5: Results from the top-performing model, mPLUG-OWl, on three examples in our dataset.
  • ...and 7 more figures