Table of Contents
Fetching ...

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

Wanqing Cui, Keping Bi, Jiafeng Guo, Xueqi Cheng

TL;DR

This work proposes a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models.

Abstract

Since commonsense information has been recorded significantly less frequently than its existence, language models pre-trained by text generation have difficulty to learn sufficient commonsense knowledge. Several studies have leveraged text retrieval to augment the models' commonsense ability. Unlike text, images capture commonsense information inherently but little effort has been paid to effectively utilize them. In this work, we propose a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models. Extensive experiments on the Common-Gen task have demonstrated the efficacy of MORE based on the pre-trained models of both single and multiple modalities.

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

TL;DR

This work proposes a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models.

Abstract

Since commonsense information has been recorded significantly less frequently than its existence, language models pre-trained by text generation have difficulty to learn sufficient commonsense knowledge. Several studies have leveraged text retrieval to augment the models' commonsense ability. Unlike text, images capture commonsense information inherently but little effort has been paid to effectively utilize them. In this work, we propose a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models. Extensive experiments on the Common-Gen task have demonstrated the efficacy of MORE based on the pre-trained models of both single and multiple modalities.
Paper Structure (29 sections, 4 equations, 8 figures, 6 tables)

This paper contains 29 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Sentences made by GPT3.5, GPT-4, and MORE given some concept words.
  • Figure 2: The process of our framework generating the sentence given input concepts based on multi-modal retrieval augmentation.
  • Figure 3: The SPICE values with respect to the number of retrieved items.
  • Figure 4: Test result of the baseline model, MORE augmented with relevant content, and MORE augmented with irrelevant content.
  • Figure 5: Generated sentences that benefit from retrieval augmentation
  • ...and 3 more figures