Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation
N. E. Kriman
TL;DR
This work tackles the challenge of measuring factuality in summaries produced by retrieval-augmented generation (RAG) systems by proposing a Naive Bayes-based framework built on atomic-fact comparisons. The method first decomposes source and summary texts into atomic facts, classifies their relationships with a predefined set of factuality categories, and then uses next-token probabilities from a fine-tuned LLM to feed a Naive Bayes classifier that predicts overall factuality. Evaluation on the AggreFact benchmark reveals correlations among factuality categories, limited separability with PCA, and a clear need for multi-hop reasoning to accurately assess fidelity. The paper outlines future directions including enhanced named entity recognition, multi-hop QA capabilities, and leveraging pre-trained entailment models to improve atomic-fact comparison and factuality estimation.
Abstract
The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as "hallucination." This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.
