Table of Contents
Fetching ...

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

N. E. Kriman

TL;DR

This work tackles the challenge of measuring factuality in summaries produced by retrieval-augmented generation (RAG) systems by proposing a Naive Bayes-based framework built on atomic-fact comparisons. The method first decomposes source and summary texts into atomic facts, classifies their relationships with a predefined set of factuality categories, and then uses next-token probabilities from a fine-tuned LLM to feed a Naive Bayes classifier that predicts overall factuality. Evaluation on the AggreFact benchmark reveals correlations among factuality categories, limited separability with PCA, and a clear need for multi-hop reasoning to accurately assess fidelity. The paper outlines future directions including enhanced named entity recognition, multi-hop QA capabilities, and leveraging pre-trained entailment models to improve atomic-fact comparison and factuality estimation.

Abstract

The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as "hallucination." This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

TL;DR

This work tackles the challenge of measuring factuality in summaries produced by retrieval-augmented generation (RAG) systems by proposing a Naive Bayes-based framework built on atomic-fact comparisons. The method first decomposes source and summary texts into atomic facts, classifies their relationships with a predefined set of factuality categories, and then uses next-token probabilities from a fine-tuned LLM to feed a Naive Bayes classifier that predicts overall factuality. Evaluation on the AggreFact benchmark reveals correlations among factuality categories, limited separability with PCA, and a clear need for multi-hop reasoning to accurately assess fidelity. The paper outlines future directions including enhanced named entity recognition, multi-hop QA capabilities, and leveraging pre-trained entailment models to improve atomic-fact comparison and factuality estimation.

Abstract

The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as "hallucination." This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.
Paper Structure (24 sections, 6 equations, 4 figures, 3 tables)

This paper contains 24 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of Retrieval Augmented Generation (Source Gao_Xiong_Gao_Jia_Pan_Bi_Dai_Sun_Wang_Wang_2024)
  • Figure 2: A review of traditional summarization metrics (Source: Saadany_Orasan_2021 )
  • Figure 3: Factuality categories evaluation correlation
  • Figure 4: PCA representation of factuality categories evaluation