Table of Contents
Fetching ...

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

Zhiwen You, Yue Guo

TL;DR

This work tackles factuality errors in biomedical plain-language summaries, especially for elaborative explanations that add external content. It introduces PlainFact, a sentence-level expert-annotated benchmark, and PlainQAFact, a retrieval-augmented QA metric that first classifies sentence factuality type and then verifies elaborations with domain knowledge. Across CELLS, PlainFact, and FareBio, PlainQAFact demonstrates superior alignment with human judgments on elaborative content, outperforming many existing metrics and matching or approaching GPT-4o in robustness. The approach provides a transparent, open-source evaluation tool that helps ensure reliable and safe plain-language medical communication, with notable improvements from selective retrieval and explainable QA steps.

Abstract

Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA)- based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents a sentence-aware, retrieval-augmented metric targeted at elaborative explanations in biomedical PLS tasks, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

TL;DR

This work tackles factuality errors in biomedical plain-language summaries, especially for elaborative explanations that add external content. It introduces PlainFact, a sentence-level expert-annotated benchmark, and PlainQAFact, a retrieval-augmented QA metric that first classifies sentence factuality type and then verifies elaborations with domain knowledge. Across CELLS, PlainFact, and FareBio, PlainQAFact demonstrates superior alignment with human judgments on elaborative content, outperforming many existing metrics and matching or approaching GPT-4o in robustness. The approach provides a transparent, open-source evaluation tool that helps ensure reliable and safe plain-language medical communication, with notable improvements from selective retrieval and explainable QA steps.

Abstract

Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA)- based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents a sentence-aware, retrieval-augmented metric targeted at elaborative explanations in biomedical PLS tasks, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact

Paper Structure

This paper contains 44 sections, 1 equation, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of PlainQAFact. A fine-tuned classifier first identifies the sentence type, involving either source simplification or elaborative explanation. Then, a QA-based evaluation pipeline performs answer extraction, question generation, question answering, and answer overlap evaluation. For elaborative content not present in the scientific abstract, PlainQAFact retrieves external knowledge to verify factual consistency. The illustrated example shows an elaborative explanation involving a "bronchodilator" not mentioned in the source abstract but verifiable through external evidence. PlainQAFact assigns a high score, reflecting strong alignment between the extracted and gold answers.
  • Figure 2: Overall performance on human-annotated elaborative explanation summaries from PlainFact (392 summaries). The std. of PlainQAFact, Llama 3.1, and GPT-4o are 0.1, 1.0, and 7.7, respectively based on five runs for each metric. * indicates a statistically significant difference compared to PlainQAFact ($p < 0.01$). PlainQAFact significantly outperforms most of the automatic factual consistency evaluation metrics in AUC-ROC. Note that the CELLS dataset does not contain annotations for elaborative explanations. Results of explanation-only evaluation on FactPICO and FareBio are reported in \ref{['app:explanation-only']}.
  • Figure 3: Score change percentage from baselines over five metrics on the FactPICO dataset in removing factual added information. We expect each metric stays unchanged even when more added factual information is removed. The evaluation dataset contains 88 valid summary-abstract pairs.
  • Figure 4: Score change percentage from baselines over five metrics on the FactPICO dataset in removing non-factual added information (60 pairs). We expect the change percentage from baseline increases when more added non-factual information is removed.
  • Figure 5: Overall performance on summaries containing added information (i.e., elaborative explanations) from FactPICO joseph-etal-2024-factpico. The std. of PlainQAFact, Llama 3.1, and GPT-4o are 0.2, 0.2, and 3.3, respectively, based on five runs of each metric.
  • ...and 1 more figures