PERCS: Persona-Guided Controllable Biomedical Summarization Dataset
Rohan Charudatt Salvi, Chirag Chawla, Dhruv Jain, Swapnil Panigrahi, Md Shad Akhtar, Shweta Yadav
TL;DR
PERCS tackles the mismatch between biomedical communication and diverse reader needs by delivering four persona-targeted summaries per abstract. The dataset was created by generating initial model summaries and then expert-validated using a detailed error taxonomy, with high inter-annotator agreement. It includes benchmarking of four LLMs across zero-shot, few-shot, and self-refine prompts, using metrics for comprehensiveness, readability, and faithfulness (ROUGE, SARI, FKGL, DCRS, CLI, LENS, SummaC). The resource and guidelines are publicly available, enabling future research on persona-aware, controllable biomedical summarization and potential improvements in health literacy.
Abstract
Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.
