PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

Zhiwen You; Yue Guo

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

Zhiwen You, Yue Guo

TL;DR

This work tackles factuality errors in biomedical plain-language summaries, especially for elaborative explanations that add external content. It introduces PlainFact, a sentence-level expert-annotated benchmark, and PlainQAFact, a retrieval-augmented QA metric that first classifies sentence factuality type and then verifies elaborations with domain knowledge. Across CELLS, PlainFact, and FareBio, PlainQAFact demonstrates superior alignment with human judgments on elaborative content, outperforming many existing metrics and matching or approaching GPT-4o in robustness. The approach provides a transparent, open-source evaluation tool that helps ensure reliable and safe plain-language medical communication, with notable improvements from selective retrieval and explainable QA steps.

Abstract

Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA)- based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents a sentence-aware, retrieval-augmented metric targeted at elaborative explanations in biomedical PLS tasks, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

TL;DR

Abstract

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)