Table of Contents
Fetching ...

A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives

Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang

Abstract

Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0\%), tabular data (13.6\%), and multimodal sources (8.4\%). Common generation methods included LLM prompting (74.6\%), fine-tuning (20.3\%), and specialized models (5.1\%). Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%). However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols. Future efforts may focus on developing standardized, transparent evaluation frameworks and expanding accessibility to support effective applications in biomedical research.

A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives

Abstract

Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0\%), tabular data (13.6\%), and multimodal sources (8.4\%). Common generation methods included LLM prompting (74.6\%), fine-tuning (20.3\%), and specialized models (5.1\%). Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%). However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols. Future efforts may focus on developing standardized, transparent evaluation frameworks and expanding accessibility to support effective applications in biomedical research.

Paper Structure

This paper contains 13 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Publications between Jan 01, 2020 and April 05, 2025. Orange and blue colors refer to peer-reviewed conference and journal publications, respectively. We can observe a surging increase on the biomedical synthetic data related studies.
  • Figure 1: Summary of data types, generation methods, accessibility, purposes and medical applications in the collected studies. Languages include English (EN), Dutch (NL), French (FR), Chinese (ZH), and Arabic (AR). Generation methods comprise fine-tuning (SFT: supervised, DAFT: domain-adaptive, TAFT: task-adaptive, IT: instruction tuning, None) and prompting (ZS: zero-shot, FS: few-shot, INST: instruction-driven, CoT: reasoning-augmented, RAG: knowledge-augmented). Synthetic data purpose includes Training (used independently), Supplement (used with real data for joint training), and Privacy (used for privacy preservation such as sharing or de-identification). Data accessibility is classified as Yes (publicly available), Request (available upon request), and No (not publicly available or not specified). Medical applications cover question answering (QA), information extraction (IE), and social determinants of health (SDoH).
  • Figure 2: Overlap among prompting strategies (INST, Zero-shot, Few-shot, CoT, RAG) in the reviewed studies. Each intersection represents studies combining multiple methods.
  • Figure 3: Evaluation characteristics of reviewed studies: disease, downstream task, data size, number of automated metrics (Metrics #), inclusion of intrinsic evaluation (Intrinsic+), human involvement in generation or task evaluations (Human-in-the-Loop), and LLM-based evaluation (LLM Eval). CLS, NER, IE, RE, QA, NLI, and ASR refer to downstream tasks of classification, named entity recognition, information extraction, relation extraction, question answering, natural language inference, and automatic speech recognition, respectively.
  • Figure 4: Heatmap linking data modality to evaluation method, summarizing the number of reviewed studies across four data modalities (Text, Tabular, Image, Audio) and four evaluation types: intrinsic evaluation (Intrinsic+), extrinsic evaluation (Extrinsic+), human evaluation (Human Eval; as in "Human-in-the-Loop" in Table \ref{['tab:evaluation']}), and LLM-based evaluation (LLM Eval).
  • ...and 1 more figures