Table of Contents
Fetching ...

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Göke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

TL;DR

FactPICO introduces a domain-specific factuality benchmark for plain-language medical summaries of randomized controlled trials. It collects 345 plain-language outputs for 115 RCT abstracts from three LLMs (GPT-4, Llama-2, Alpaca), annotated by experts for PICO elements, evidence inferences, and added information, with expert rationales. The study benchmarks existing factuality metrics and novel LLM-based evaluators, finding that system-level correlations are stronger than instance-level ones and that decomposing tasks via a PICO-R extraction pipeline improves evaluation. Results reveal persistent challenges in balancing readability and factuality in medical text, with notable overgeneralization and non-factual elaborations, underscoring the need for domain-aware metrics and explainable evaluation approaches. FactPICO thus provides a rigorous resource for developing and benchmarking factuality assessments in high-stakes plain-language medical summarization.

Abstract

Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

TL;DR

FactPICO introduces a domain-specific factuality benchmark for plain-language medical summaries of randomized controlled trials. It collects 345 plain-language outputs for 115 RCT abstracts from three LLMs (GPT-4, Llama-2, Alpaca), annotated by experts for PICO elements, evidence inferences, and added information, with expert rationales. The study benchmarks existing factuality metrics and novel LLM-based evaluators, finding that system-level correlations are stronger than instance-level ones and that decomposing tasks via a PICO-R extraction pipeline improves evaluation. Results reveal persistent challenges in balancing readability and factuality in medical text, with notable overgeneralization and non-factual elaborations, underscoring the need for domain-aware metrics and explainable evaluation approaches. FactPICO thus provides a rigorous resource for developing and benchmarking factuality assessments in high-stakes plain-language medical summarization.

Abstract

Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.
Paper Structure (59 sections, 7 figures, 16 tables)

This paper contains 59 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Expert evaluation of a GPT-4 plain language summary in FactPICO. We omitted the original abstract (can be found in Appendix \ref{['sec:abs_full_txt']}) in this figure due to space limit. More examples in Appendix \ref{['app:factpicoexamples']}.
  • Figure 2: QuestEval (left), GPT-4 eval (mid), and Extract (right) against Avg. PICO-R (x-axis) for plain language summaries generated by GPT-4 (red), Llama-2 (blue), and Alpaca (orange). Label distributions shown on the sides.
  • Figure 3: Plots of estimated Gaussian probability density functions from the standardized distributions of evaluated metrics.
  • Figure 4: All traditional factuality metrics and LLMs (no Extract) plotted against avg. PICO-R. Note that human and LLMs scores are flipped (as $5-original$) to be consistent with metrics in Section \ref{['sec:factualitymodels']}, hence higher is better.
  • Figure 5: The initial state of the Thresh interface (top) and the state after annotations have been completed (bottom).
  • ...and 2 more figures