Table of Contents
Fetching ...

Learning to Verify Summary Facts with Fine-Grained LLM Feedback

Jihwan Oh, Jeonghwan Choi, Nicole Hee-Yeon Kim, Taewon Yun, Hwanjun Song

TL;DR

The paper addresses the high cost and limited scalability of human-labeled fact verification data for summaries by introducing FineSumFact, a large-scale dataset of fine-grained LLM-generated feedback. It presents a four-stage pipeline that generates diverse summaries from 10 LLMs, collects sentence-level feedback via an off-the-shelf verifier (FineSurE) powered by Llama-3-70B-Instruct, and trains a lightweight verifier (Llama-3-8B-Instruct) through sequence-level distillation (QLoRA). Empirical results show that training with extensive LLM feedback yields higher agreement with human judgments than human-only baselines and outperforms QA/NLI-based evaluators, with added benefits from explainable feedback such as reasoning and error localization. The approach also indicates substantial gains in efficiency, achieving near-teacher performance with significantly faster inference and lower API costs, suggesting a scalable path for domain-general fact verification of summaries.

Abstract

Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at https://github.com/DISL-Lab/FineSumFact.

Learning to Verify Summary Facts with Fine-Grained LLM Feedback

TL;DR

The paper addresses the high cost and limited scalability of human-labeled fact verification data for summaries by introducing FineSumFact, a large-scale dataset of fine-grained LLM-generated feedback. It presents a four-stage pipeline that generates diverse summaries from 10 LLMs, collects sentence-level feedback via an off-the-shelf verifier (FineSurE) powered by Llama-3-70B-Instruct, and trains a lightweight verifier (Llama-3-8B-Instruct) through sequence-level distillation (QLoRA). Empirical results show that training with extensive LLM feedback yields higher agreement with human judgments than human-only baselines and outperforms QA/NLI-based evaluators, with added benefits from explainable feedback such as reasoning and error localization. The approach also indicates substantial gains in efficiency, achieving near-teacher performance with significantly faster inference and lower API costs, suggesting a scalable path for domain-general fact verification of summaries.

Abstract

Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at https://github.com/DISL-Lab/FineSumFact.

Paper Structure

This paper contains 34 sections, 4 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Pipeline: our evaluator is trained with LLM feedback generated on diverse input texts and summaries and then tested on an unseen test set.
  • Figure 2: Prompt for fact verification ("Binary Label" in Table \ref{['table:granularity-feedback']}).
  • Figure 3: Prompt for fact verification ("Binary Label + Reasoning" in Table \ref{['table:granularity-feedback']}).
  • Figure 4: Prompt for fact verification ("Binary Label + Reasoning + Error Localization" in Table \ref{['table:granularity-feedback']}, which is exactly the same with FineSurE song2024finesure).
  • Figure 5: Error category distribution of summaries with LLM feedback for each summarizer, where the error category is estimated using the automated fact verification.