MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

Wanqing Cui; Keping Bi; Jiafeng Guo; Xueqi Cheng

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

Wanqing Cui, Keping Bi, Jiafeng Guo, Xueqi Cheng

TL;DR

This work proposes a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models.

Abstract

Since commonsense information has been recorded significantly less frequently than its existence, language models pre-trained by text generation have difficulty to learn sufficient commonsense knowledge. Several studies have leveraged text retrieval to augment the models' commonsense ability. Unlike text, images capture commonsense information inherently but little effort has been paid to effectively utilize them. In this work, we propose a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models. Extensive experiments on the Common-Gen task have demonstrated the efficacy of MORE based on the pre-trained models of both single and multiple modalities.

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

TL;DR

This work proposes a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models.

Abstract

Paper Structure (29 sections, 4 equations, 8 figures, 6 tables)

This paper contains 29 sections, 4 equations, 8 figures, 6 tables.

Introduction
Related Work
Retrieval Augmented Generation
Image Enhanced Text Generation
Generative Commonsense Reasoning
Preliminaries
Multi-Modal Retrieval Augmention
Retrieval Results for Augmentation
Multi-Modal Encoder
Retrieved Information Integrator
Soft Prompt Based Text Generation
Training Strategy
Experiments Settings
Dataset
Methods for Comparisons
...and 14 more sections

Figures (8)

Figure 1: Sentences made by GPT3.5, GPT-4, and MORE given some concept words.
Figure 2: The process of our framework generating the sentence given input concepts based on multi-modal retrieval augmentation.
Figure 3: The SPICE values with respect to the number of retrieved items.
Figure 4: Test result of the baseline model, MORE augmented with relevant content, and MORE augmented with irrelevant content.
Figure 5: Generated sentences that benefit from retrieval augmentation
...and 3 more figures

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

TL;DR

Abstract

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)