APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

Yue Guo; Tal August; Gondy Leroy; Trevor Cohen; Lucy Lu Wang

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang

TL;DR

This paper tackles the challenge of evaluating plain language summarization (PLS) by proposing APPLS, a granular meta-evaluation testbed. APPLS applies 11 perturbations across four criteria (informativeness, simplification, coherence, faithfulness) to two diagnostic datasets CELLS and PLABA to probe metric sensitivity. The authors evaluate 14 metrics spanning automated, lexical, and LLM-prompt-based methods, revealing that no single metric covers all four criteria and recommending a suite of metrics; SARI emerges as uniquely sensitive to simplification, while prompt-based evaluations excel for informativeness, faithfulness, and simplification but not coherence. They provide guidance for robust automated evaluation of PLS and demonstrate extensibility to other domains, with code available at https://github.com/LinguisticAnomalies/APPLS.

Abstract

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work -- informativeness, simplification, coherence, and faithfulness -- and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to extractive hypotheses for two PLS datasets to form our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics. APPLS and our evaluation code is available at https://github.com/LinguisticAnomalies/APPLS.

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

TL;DR

Abstract

Paper Structure (27 sections, 16 figures, 3 tables)

This paper contains 27 sections, 16 figures, 3 tables.

Introduction
Related Work
Limitations of Existing Metrics
Robust Analysis with Synthetic Data
PLS Evaluation Criteria
Constructing the APPLS Testbed
Diagnostic datasets
Applying perturbations to datasets
Informativeness
Simplification
Coherence
Faithfulness
Human validation of candidate summaries and LLM simplifications
Evaluation Metrics
Existing automated evaluation metrics
...and 12 more sections

Figures (16)

Figure 1: We present APPLS, the first granular testbed for analyzing evaluation metric performance for plain language summarization (PLS). We assess performance of 14 existing metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
Figure 2: Average scores of existing metrics for perturbed texts in the CELLS dataset. Scores are averaged in 10 bins by perturbation percentage. Markers denote the defined criteria associated with that perturbation. Median reported improvements in ACL'22 summarization and generation papers are ROUGE (+0.89), BLEU (+0.69), METEOR (+0.50), SARI (+1.71), BERTScore (+0.55), and PPL (-2.06).
Figure 3: Relative change of each lexical feature with respect to the unperturbed state (0%). Different markers represent lexical feature categories.
Figure 4: Prompt-based evaluation scores for four criteria - informativeness, simplification, coherence, and faithfulness - along with an overall score. (a) providing a single criterion in the prompt and requesting a score for that criterion; (b) providing all criteria in the prompt and requesting scores for each criterion as well as an overall score; and (c) the same setting as (b) but with an additional requirement for explanations of the provided scores. Notably, prompt-based scores demonstrate sensitivity to perturbations in informativeness, faithfulness, and simplification, while showing less sensitivity to changes in coherence. The three prompt settings yield similar results, with the exception that providing all criteria (setting b and c) is more sensitive to entity swaps compared to providing a single criterion (setting a).
Figure 5: Variation in existing scores for simplification perturbations created by GPT-3, Llama2, and Claude on the CELLS dataset.
...and 11 more figures

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

TL;DR

Abstract

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (16)