Table of Contents
Fetching ...

Generating Faithful and Salient Text from Multimodal Data

Tahsina Hashem, Weiqing Wang, Derry Tanti Wijaya, Mohammed Eunus Ali, Yuan-Fang Li

TL;DR

A framework to generate faithful and salient text from mixed-modal data, which includes images and structured data, is developed and a vision critic model is trained to identify hallucinated and non-salient features from the image modality.

Abstract

While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a small vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs' generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination.

Generating Faithful and Salient Text from Multimodal Data

TL;DR

A framework to generate faithful and salient text from mixed-modal data, which includes images and structured data, is developed and a vision critic model is trained to identify hallucinated and non-salient features from the image modality.

Abstract

While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a small vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs' generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination.
Paper Structure (26 sections, 2 equations, 14 figures, 9 tables)

This paper contains 26 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: A Sample Input and Output of an LMM: MiniGPT4. The Output Analysis lists the errors.
  • Figure 2: The Pipeline of our Framework for Salient and Faithful Multimodal Data to Text Generation 1) Generating Text using LMM 2) Extracting Image Features from the Text using GPT-3.5 3) Trained Vision Critic Model gives feedback to LMM 4) LMM update the Text by making corrections.
  • Figure 3: Prompt Template for LMM to generate key features of the image for House dataset
  • Figure 4: Prompt Template for LMM to generate text for House dataset
  • Figure 5: Prompt Template for LLM to extract list of features from a sentence
  • ...and 9 more figures