Table of Contents
Fetching ...

Yin and Yang: Balancing and Answering Binary Visual Questions

Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

TL;DR

<3-5 sentence high-level summary> The paper tackles binary yes/no visual question answering by reframing it as a visual verification task on abstract scenes, thereby mitigating language priors. It introduces a two-stage pipeline that converts questions into concise <P,R,S> tuples and conducts attention-based visual verification to determine presence of the concept. A key contribution is balancing the VQA dataset with complementary scenes, which reduces linguistic bias and exposes true visual reasoning needs. The approach, aided by region-focused attention and robust tuple alignment, outperforms language-only baselines and prior VQA methods on the balanced dataset, and the authors provide qualitative and ablation evidence supporting the design. The work also discusses generalization challenges and releases a balanced abstract-scene dataset to spur future research in grounded visual reasoning.</} }

Abstract

The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is "yes", and otherwise "no". Abstract scenes play two roles (1) They allow us to focus on the high-level semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential. In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is "yes" for one scene, and "no" for the other for the exact same question. Indeed, language priors alone do not perform better than chance on our balanced dataset. Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset.

Yin and Yang: Balancing and Answering Binary Visual Questions

TL;DR

<3-5 sentence high-level summary> The paper tackles binary yes/no visual question answering by reframing it as a visual verification task on abstract scenes, thereby mitigating language priors. It introduces a two-stage pipeline that converts questions into concise <P,R,S> tuples and conducts attention-based visual verification to determine presence of the concept. A key contribution is balancing the VQA dataset with complementary scenes, which reduces linguistic bias and exposes true visual reasoning needs. The approach, aided by region-focused attention and robust tuple alignment, outperforms language-only baselines and prior VQA methods on the balanced dataset, and the authors provide qualitative and ablation evidence supporting the design. The work also discusses generalization challenges and releases a balanced abstract-scene dataset to spur future research in grounded visual reasoning.</} }

Abstract

The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is "yes", and otherwise "no". Abstract scenes play two roles (1) They allow us to focus on the high-level semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential. In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is "yes" for one scene, and "no" for the other for the exact same question. Indeed, language priors alone do not perform better than chance on our balanced dataset. Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset.

Paper Structure

This paper contains 30 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We address the problem of answering binary questions about images. To eliminate strong language priors that shadow the role of detailed visual understanding in visual question answering (VQA), we use abstract scenes to collect a balanced dataset containing pairs of complementary scenes: the two scenes have opposite answers to the same question, while being visually as similar as possible. We view the task of answering binary questions as a visual verification task: we convert the question into a tuple that concisely summarizes the visual concept, which if present, result in the answer of the question being "yes", and otherwise "no". Our approach attends to relevant portions of the image when verifying the presence of the visual concept.
  • Figure 2: A snapshot of our Amazon Mechanical Turk (AMT) interface to collect complementary scenes.
  • Figure 3: Most plausible words for an object determined using mutual information.
  • Figure 4: Qualitative results of our approach. We show input questions, complementary scenes that are subtle (semantic) perturbations of each other, along with tuples extracted by our approach, and objects in the scenes that our model chooses to attend to while answering the question. Primary object is shown in red and secondary object is in blue.
  • Figure 5: A subset of the clipart objects present in the abstract scenes library.
  • ...and 5 more figures