Table of Contents
Fetching ...

Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

Yue Zhang, Jingxuan Zuo, Ke Su, Liqiang Jing

TL;DR

This work addresses factuality in multimodal summarization (text plus image) by introducing FALLACIOUS, two fine-grained, explainable evaluation frameworks for reference-based and reference-free settings. The authors formalize the problem, proposing QA- and VQA-driven pipelines with question generation, answer generation, and score aggregation to yield explicit factuality scores. Through extensive experiments on MMS and CEPSUM, using GPT-4 and BLIP-2, they show FALLACIOUS correlates more strongly with human judgments than existing metrics and demonstrates robustness across datasets and settings. The study also discusses limitations of prior metrics and demonstrates transferability, with code and datasets to enable reproducibility and broader adoption.

Abstract

Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.

Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

TL;DR

This work addresses factuality in multimodal summarization (text plus image) by introducing FALLACIOUS, two fine-grained, explainable evaluation frameworks for reference-based and reference-free settings. The authors formalize the problem, proposing QA- and VQA-driven pipelines with question generation, answer generation, and score aggregation to yield explicit factuality scores. Through extensive experiments on MMS and CEPSUM, using GPT-4 and BLIP-2, they show FALLACIOUS correlates more strongly with human judgments than existing metrics and demonstrates robustness across datasets and settings. The study also discusses limitations of prior metrics and demonstrates transferability, with code and datasets to enable reproducibility and broader adoption.

Abstract

Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.
Paper Structure (22 sections, 2 equations, 3 figures, 3 tables)

This paper contains 22 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The proposed reference-based framework.
  • Figure 2: The proposed reference-free framework.
  • Figure 3: Examples of evaluation methods. QA denotes the answer based on textual modality and VQA denotes the answer based on visual modality. The final answer is the combination of QA and VQA. Score means the score of the proposed metric.