Table of Contents
Fetching ...

Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models

Colleen Gilhuly, Haleh Shahzad

TL;DR

The paper evaluates a broad set of news-summarization approaches, from extractive TextRank to large instruction-tuned LLMs, using both standard metrics and novel LLM-powered evaluations of factual consistency. It introduces meta-evaluation to gauge the reliability of the evaluation prompts and models themselves. Across the XL-Sum English subset, many models achieve high consistency with the source text, while XL-Sum references often contain noninferable information, highlighting evaluation challenges. QA and fact-checking based evaluations offer complementary insights, with fact-checking generally delivering more stable alignment with factuality, though neither approach is perfect. The study underscores the need for robust meta-evaluations and careful consideration of dataset quality when benchmarking modern summarization models.

Abstract

Text summarizing is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Large Language Models (LLMs) have shown remarkable promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. Regardless of the method of generating a summary, high quality automated evaluations remain an open area of investigation. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. The generated summaries are evaluated using traditional metrics such as the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score and Bidirectional Encoder Representations from Transformers (BERT) Score, as well as LLM-powered evaluation methods that directly assess a generated summary's consistency with the source text. We introduce a meta evaluation score which directly assesses the performance of the LLM evaluation system (prompt + model). We find that that all summarization models produce consistent summaries when tested on the XL-Sum dataset, exceeding the consistency of the reference summaries.

Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models

TL;DR

The paper evaluates a broad set of news-summarization approaches, from extractive TextRank to large instruction-tuned LLMs, using both standard metrics and novel LLM-powered evaluations of factual consistency. It introduces meta-evaluation to gauge the reliability of the evaluation prompts and models themselves. Across the XL-Sum English subset, many models achieve high consistency with the source text, while XL-Sum references often contain noninferable information, highlighting evaluation challenges. QA and fact-checking based evaluations offer complementary insights, with fact-checking generally delivering more stable alignment with factuality, though neither approach is perfect. The study underscores the need for robust meta-evaluations and careful consideration of dataset quality when benchmarking modern summarization models.

Abstract

Text summarizing is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Large Language Models (LLMs) have shown remarkable promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. Regardless of the method of generating a summary, high quality automated evaluations remain an open area of investigation. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. The generated summaries are evaluated using traditional metrics such as the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score and Bidirectional Encoder Representations from Transformers (BERT) Score, as well as LLM-powered evaluation methods that directly assess a generated summary's consistency with the source text. We introduce a meta evaluation score which directly assesses the performance of the LLM evaluation system (prompt + model). We find that that all summarization models produce consistent summaries when tested on the XL-Sum dataset, exceeding the consistency of the reference summaries.

Paper Structure

This paper contains 31 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the LLM-powered QA evaluation.
  • Figure 2: Overview of the LLM-powered fact-checking evaluation.
  • Figure 3: Average ROUGE scores for each model. The blue bars show ROUGE-1 and the orange bars show ROUGE-L. The small black error bars depict the approximate 95% confidence interval of the averages. The left panel depicts the traditional ROUGE evaluation against reference summaries while the right panel depicts a modified ROUGE evaluation against the source article. The XL-Sum reference summaries are included in the modified evaluation on the right, depicted in grey.
  • Figure 4: Average BERTScore scores for each model. The small black error bars depict the approximate 95% confidence interval of the averages. The left panel depicts the traditional ROUGE evaluation against reference summaries while the right panel depicts a modified ROUGE evaluation against the source article. The XL-Sum reference summaries are included in the modified evaluation on the right, depicted in grey.
  • Figure 5: Average consistency, hallucination, and meta evaluation scores for the QA evaluation when applied to model generated summaries (blue bars) and the XL-Sum reference summaries (grey bar). The small black error bars depict the approximate 95% confidence interval of the averages.
  • ...and 1 more figures