Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

Hyuntae Park; Yeachan Kim; SangKeun Lee

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

Hyuntae Park, Yeachan Kim, SangKeun Lee

TL;DR

This work proposes Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images that substantially outperforms existing zero-shot approaches and even surpasses advanced large language models.

Abstract

Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

TL;DR

Abstract

Paper Structure (38 sections, 9 equations, 9 figures, 16 tables)

This paper contains 38 sections, 9 equations, 9 figures, 16 tables.

Introduction
Related Work
Zero-shot Commonsense Reasoning
Incorporating Visual Signals for Natural Language Understanding
Vision-Language Models and Commonsense Knowledge Bases
Machine Imagination-based Reasoning
Machine Imagination in PLMs
Synthetic VQA & Synthetic VQA$\boldsymbol{+}$ Construction
Synthetic VQA
Synthetic VQA$\boldsymbol{+}$
Pre-training Imagine on Synthetic VQA
Inference from $\textsc{Imagine}$
Faster Inference via Image Retrieval
Experiments
Experimental Setup
...and 23 more sections

Figures (9)

Figure 1: An example from the PIQA dataset DBLP:conf/aaai/BiskZLGC20-piqa with model predictions. Imagine performs reasoning by leveraging machine-generated images to enhance understanding of the question.
Figure 2: Overall procedures for (a) constructing a Synthetic VQA dataset and (b) the inference/optimization phase of Imagine (ours) using the given QA pair. The process starts with the textual pair consisting of a question and its answers, followed by the generation of visual signals (i.e., imagination) conditioned on the question. The two distinct features from visual and textual models are then utilized to derive a comprehensive prediction.
Figure 3: Examples of the Synthetic VQA$\boldsymbol{+}$ dataset. Our dataset is sourced from AbstractATOMIC DBLP:conf/emnlp/WangF0XLSB23-car, VCR DBLP:conf/cvpr/ZellersBFC19-vcr, and Sherlock DBLP:conf/eccv/HesselHPZBRSC22. Bold indicates the correct answer, and Underline denotes the generated image caption.
Figure 4: Examples of less plausible data filtered during the construction of Synthetic VQA$\boldsymbol{+}$. We measured the commonsense plausibility of (question, correct answer) pairs using the VERA model DBLP:conf/emnlp/0010WWS0H23-vera. Bold indicates the correct answer.
Figure 5: Examples of generated and retrieved images based on the input question. The first row shows cases where retrieved images are helpful for the inference, while the second row shows cases where retrieved images are not helpful.
...and 4 more figures

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

TL;DR

Abstract

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

Authors

TL;DR

Abstract

Table of Contents

Figures (9)