Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization
Yue Zhang, Jingxuan Zuo, Ke Su, Liqiang Jing
TL;DR
This work addresses factuality in multimodal summarization (text plus image) by introducing FALLACIOUS, two fine-grained, explainable evaluation frameworks for reference-based and reference-free settings. The authors formalize the problem, proposing QA- and VQA-driven pipelines with question generation, answer generation, and score aggregation to yield explicit factuality scores. Through extensive experiments on MMS and CEPSUM, using GPT-4 and BLIP-2, they show FALLACIOUS correlates more strongly with human judgments than existing metrics and demonstrates robustness across datasets and settings. The study also discusses limitations of prior metrics and demonstrates transferability, with code and datasets to enable reproducibility and broader adoption.
Abstract
Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.
