Table of Contents
Fetching ...

PERCS: Persona-Guided Controllable Biomedical Summarization Dataset

Rohan Charudatt Salvi, Chirag Chawla, Dhruv Jain, Swapnil Panigrahi, Md Shad Akhtar, Shweta Yadav

TL;DR

PERCS tackles the mismatch between biomedical communication and diverse reader needs by delivering four persona-targeted summaries per abstract. The dataset was created by generating initial model summaries and then expert-validated using a detailed error taxonomy, with high inter-annotator agreement. It includes benchmarking of four LLMs across zero-shot, few-shot, and self-refine prompts, using metrics for comprehensiveness, readability, and faithfulness (ROUGE, SARI, FKGL, DCRS, CLI, LENS, SummaC). The resource and guidelines are publicly available, enabling future research on persona-aware, controllable biomedical summarization and potential improvements in health literacy.

Abstract

Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.

PERCS: Persona-Guided Controllable Biomedical Summarization Dataset

TL;DR

PERCS tackles the mismatch between biomedical communication and diverse reader needs by delivering four persona-targeted summaries per abstract. The dataset was created by generating initial model summaries and then expert-validated using a detailed error taxonomy, with high inter-annotator agreement. It includes benchmarking of four LLMs across zero-shot, few-shot, and self-refine prompts, using metrics for comprehensiveness, readability, and faithfulness (ROUGE, SARI, FKGL, DCRS, CLI, LENS, SummaC). The resource and guidelines are publicly available, enabling future research on persona-aware, controllable biomedical summarization and potential improvements in health literacy.

Abstract

Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.

Paper Structure

This paper contains 3 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Persona-specific summarization of a biomedical abstract in PERCS. Green represents simplification, and Blue represents information detail.
  • Figure 2: Example of a prompt design used for Lay persona summary generation in the PERCS dataset
  • Figure 3: An overview of the steps involved in constructing the PERCS dataset, namely data collection, summary generation, and expert validation.
  • Figure 4: Example of addressing logical errors and supplying persona-specific missing information to produce an appropriate summary for the persona.