Table of Contents
Fetching ...

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

Hyuntae Park, Yeachan Kim, SangKeun Lee

TL;DR

This work proposes Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images that substantially outperforms existing zero-shot approaches and even surpasses advanced large language models.

Abstract

Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

TL;DR

This work proposes Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images that substantially outperforms existing zero-shot approaches and even surpasses advanced large language models.

Abstract

Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models
Paper Structure (38 sections, 9 equations, 9 figures, 16 tables)

This paper contains 38 sections, 9 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: An example from the PIQA dataset DBLP:conf/aaai/BiskZLGC20-piqa with model predictions. Imagine performs reasoning by leveraging machine-generated images to enhance understanding of the question.
  • Figure 2: Overall procedures for (a) constructing a Synthetic VQA dataset and (b) the inference/optimization phase of Imagine (ours) using the given QA pair. The process starts with the textual pair consisting of a question and its answers, followed by the generation of visual signals (i.e., imagination) conditioned on the question. The two distinct features from visual and textual models are then utilized to derive a comprehensive prediction.
  • Figure 3: Examples of the Synthetic VQA$\boldsymbol{+}$ dataset. Our dataset is sourced from AbstractATOMIC DBLP:conf/emnlp/WangF0XLSB23-car, VCR DBLP:conf/cvpr/ZellersBFC19-vcr, and Sherlock DBLP:conf/eccv/HesselHPZBRSC22. Bold indicates the correct answer, and Underline denotes the generated image caption.
  • Figure 4: Examples of less plausible data filtered during the construction of Synthetic VQA$\boldsymbol{+}$. We measured the commonsense plausibility of (question, correct answer) pairs using the VERA model DBLP:conf/emnlp/0010WWS0H23-vera. Bold indicates the correct answer.
  • Figure 5: Examples of generated and retrieved images based on the input question. The first row shows cases where retrieved images are helpful for the inference, while the second row shows cases where retrieved images are not helpful.
  • ...and 4 more figures