Table of Contents
Fetching ...

VISREAS: Complex Visual Reasoning with Unanswerable Questions

Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg

TL;DR

A new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations, and a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer.

Abstract

Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.

VISREAS: Complex Visual Reasoning with Unanswerable Questions

TL;DR

A new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations, and a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer.

Abstract

Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.
Paper Structure (34 sections, 14 figures, 9 tables)

This paper contains 34 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of VisReas dataset construction process. Using scene graphs, we cluster objects (orange), relations, and attributes of the related objects (blue) based on the attribute of the corresponding objects (orange). Then the question engine takes each template as input and traverses all possible clusters to generate the query as well as the reasoning steps. Each function in the reasoning steps can return NONE if any object, attribute, or relation is absent in the image.
  • Figure 2: Overview of VisReas statistics. (Top left) The dataset covers 14 attributes in a balanced ratio. (Top right) It consists of five reasoning types of queries in a balanced distribution. (Bottom left) Comparison of multi-hop relation traversal for different VQA datasets. Majority questions of VisReas require multi-hop traversal compared to others. (Bottom right) Comparison of number of objects mentioned in the question for different datasets where VisReas questions contain larger amount of objects.
  • Figure 3: VisReas contains two types of relation traversals. Star relation states a single object that shares multiple relations with other objects (Left). Chain relation states multiple objects that share a single relation with each other (Right).
  • Figure 4: Overview of Logic2Vision. In Pseudocode Generation phase, we generate pseudocode which outlines the reasoning steps. During Pseudocode-Guided Reasoning, the pseudocodes along with the question and image are provided to the model. The model executes all intermediate pseudocodes to arrive at the final answer.
  • Figure 5: Overview of types of questions along with some templates and examples from the VisReas corpus.
  • ...and 9 more figures