Table of Contents
Fetching ...

Agent-as-Judge for Factual Summarization of Long Narratives

Yeonseok Jeong, Minsoo Kim, Seung-won Hwang, Byung-Hak Kim

TL;DR

This work tackles factuality gaps in long-narrative summarization by introducing NarrativeFactScore, an agent-based evaluation framework powered by a Consistent Character Knowledge Graph to assess and refine summaries. It decomposes summaries into atomic facts, retrieves scene evidence and CKG subgraphs, and provides actionable feedback guiding iterative refinement. Across STORYSUMM, FABLES, MENSA, and MovieSum, NarrativeFactScore shows strong correlation with human factuality and yields improvements in factuality and standard summarization metrics after refinement. The results demonstrate the potential of agent-driven evaluation systems to enhance the reliability of LLM-generated long-form narratives and suggest applicability beyond narratives to domains requiring deep character or relational reasoning.

Abstract

Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore, a novel "Agent-as-a-Judge" framework for evaluating and refining summaries. By leveraging a Character Knowledge Graph (CKG) extracted from input and generated summaries, NarrativeFactScore assesses the factual consistency and provides actionable guidance for refinement, such as identifying missing or erroneous facts. We demonstrate the effectiveness of NarrativeFactScore through a detailed workflow illustration and extensive validation on widely adopted benchmarks, achieving superior performance compared to competitive methods. Our results highlight the potential of agent-driven evaluation systems to improve the factual reliability of LLM-generated summaries.

Agent-as-Judge for Factual Summarization of Long Narratives

TL;DR

This work tackles factuality gaps in long-narrative summarization by introducing NarrativeFactScore, an agent-based evaluation framework powered by a Consistent Character Knowledge Graph to assess and refine summaries. It decomposes summaries into atomic facts, retrieves scene evidence and CKG subgraphs, and provides actionable feedback guiding iterative refinement. Across STORYSUMM, FABLES, MENSA, and MovieSum, NarrativeFactScore shows strong correlation with human factuality and yields improvements in factuality and standard summarization metrics after refinement. The results demonstrate the potential of agent-driven evaluation systems to enhance the reliability of LLM-generated long-form narratives and suggest applicability beyond narratives to domains requiring deep character or relational reasoning.

Abstract

Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore, a novel "Agent-as-a-Judge" framework for evaluating and refining summaries. By leveraging a Character Knowledge Graph (CKG) extracted from input and generated summaries, NarrativeFactScore assesses the factual consistency and provides actionable guidance for refinement, such as identifying missing or erroneous facts. We demonstrate the effectiveness of NarrativeFactScore through a detailed workflow illustration and extensive validation on widely adopted benchmarks, achieving superior performance compared to competitive methods. Our results highlight the potential of agent-driven evaluation systems to improve the factual reliability of LLM-generated summaries.
Paper Structure (43 sections, 5 equations, 10 figures, 13 tables)

This paper contains 43 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparison of factuality evaluation by LLM and Agent Judge with NarrativeFactScore. Given scenes from The Lord of the Rings, the summary incorrectly claims "Sauron is pursuing Gandalf." The LLM Judge assigns 100% factuality score, while our Agent Judge correctly identifies this error through analyzing atomic facts about characters, assigning 75% NarrativeFactScore, with specific feedback.
  • Figure 2: The main figure illustrates the overall process of evaluation and refinement, which includes three main stages. First, it shows the extraction of CKG $G$ from narrative $\mathcal{N}$. Next, it depicts the calculation of factuality by comparing the decomposed summary $a_k$ against the retrieved character relationship subgraph $g$ and narrative scene $\mathcal{S}_i$. Finally, it illustrates the agent-based refinement process, where feedbacks ($f_1, f_2, ...$) are used to improve the factual accuracy of the summary.
  • Figure 3: (a) Part of a knowledge graph generated from The Lord of the Rings, with three named entities. 'Frodo/Frodo Baggins' is a single entity with two names. (b) The same graph is in linearized form.
  • Figure 4: Deployment overview of NarrativeFactScore.
  • Figure 5: Simplified prompt for named entity recognition and knowledge graph edges generation.
  • ...and 5 more figures