Generating Faithful and Salient Text from Multimodal Data

Tahsina Hashem; Weiqing Wang; Derry Tanti Wijaya; Mohammed Eunus Ali; Yuan-Fang Li

Generating Faithful and Salient Text from Multimodal Data

Tahsina Hashem, Weiqing Wang, Derry Tanti Wijaya, Mohammed Eunus Ali, Yuan-Fang Li

TL;DR

A framework to generate faithful and salient text from mixed-modal data, which includes images and structured data, is developed and a vision critic model is trained to identify hallucinated and non-salient features from the image modality.

Abstract

While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a small vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs' generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination.

Generating Faithful and Salient Text from Multimodal Data

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 14 figures, 9 tables)

This paper contains 26 sections, 2 equations, 14 figures, 9 tables.

Introduction
Related Work
Multimodal Data to Text generation
Hallucination in LMMs
Hallucination Mitigation of LMMs
Method
Problem Formulation
Training a Small Vision Language Model
Classifying Image Feature
Listing Salient Image Features
Training Data Generation
Post-hoc Text Editing from the Feedback given by the Critic Model
Experiments
Dataset
Baseline Models
...and 11 more sections

Figures (14)

Figure 1: A Sample Input and Output of an LMM: MiniGPT4. The Output Analysis lists the errors.
Figure 2: The Pipeline of our Framework for Salient and Faithful Multimodal Data to Text Generation 1) Generating Text using LMM 2) Extracting Image Features from the Text using GPT-3.5 3) Trained Vision Critic Model gives feedback to LMM 4) LMM update the Text by making corrections.
Figure 3: Prompt Template for LMM to generate key features of the image for House dataset
Figure 4: Prompt Template for LMM to generate text for House dataset
Figure 5: Prompt Template for LLM to extract list of features from a sentence
...and 9 more figures

Generating Faithful and Salient Text from Multimodal Data

TL;DR

Abstract

Generating Faithful and Salient Text from Multimodal Data

Authors

TL;DR

Abstract

Table of Contents

Figures (14)