Table of Contents
Fetching ...

Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Sanjana Ramprasad, Kundan Krishna, Zachary C Lipton, Byron C Wallace

TL;DR

This work extends zero-shot summarization evaluation beyond news to niche domains (medicine and law), focusing on factuality. By conducting two rounds of expert annotations on GPT-3.5-turbo-0301 and Flan-T5-XL outputs across News, PubMed, and Billsum, the authors quantify error prevalence, categorize error types, and examine the relation between extractiveness and factuality. They find that news summaries exhibit more extrinsic errors than specialized domains, and that domain pretraining exposure may influence factuality through increased memorization or content reproduction. Automated factuality metrics underperform in non-news domains, highlighting the need for domain-aware evaluation and sharing of the annotated dataset to advance fact-aware summarization research.

Abstract

Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot (i.e., without explicit supervision) that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains? In this work we evaluate zero-shot generated summaries across specialized domains including biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles. The dataset can be downloaded from https://github.com/sanjanaramprasad/zero_shot_faceval_domains

Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

TL;DR

This work extends zero-shot summarization evaluation beyond news to niche domains (medicine and law), focusing on factuality. By conducting two rounds of expert annotations on GPT-3.5-turbo-0301 and Flan-T5-XL outputs across News, PubMed, and Billsum, the authors quantify error prevalence, categorize error types, and examine the relation between extractiveness and factuality. They find that news summaries exhibit more extrinsic errors than specialized domains, and that domain pretraining exposure may influence factuality through increased memorization or content reproduction. Automated factuality metrics underperform in non-news domains, highlighting the need for domain-aware evaluation and sharing of the annotated dataset to advance fact-aware summarization research.

Abstract

Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot (i.e., without explicit supervision) that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains? In this work we evaluate zero-shot generated summaries across specialized domains including biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles. The dataset can be downloaded from https://github.com/sanjanaramprasad/zero_shot_faceval_domains
Paper Structure (20 sections, 7 figures, 3 tables)

This paper contains 20 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Distribution of errors and error categories across domains
  • Figure 2: Proportion of 3-gram overlaps between model generated summaries and articles. We observe the most copying in the case of PubMed (especially under Flan-T5). This likely explains the greater factuality observed in this domain, and may reflect unfamiliarity with the domain (see Figure \ref{['fig:fig-rougel']}).
  • Figure 3: ROUGE-L recall scores of original articles in comparison with LLM-generated documents to measure domain exposure during pretraining. Models show higher familiarity with news topics, which may lead to the inclusion of unsupported content in summaries.
  • Figure 4: Annotation interface with questions asked and example annotation on both round of annotations
  • Figure 5: Examples of sentences annotated with different categories in the news dataset by annotators along with comments provided.
  • ...and 2 more figures