Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Sunjun Kweon; Junu Kim; Jiyoun Kim; Sujeong Im; Eunbyeol Cho; Seongsu Bae; Jungwoo Oh; Gyubok Lee; Jong Hak Moon; Seng Chan You; Seungjin Baek; Chang Hoon Han; Yoon Bin Jung; Yohan Jo; Edward Choi

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Sunjun Kweon, Junu Kim, Jiyoun Kim, Sujeong Im, Eunbyeol Cho, Seongsu Bae, Jungwoo Oh, Gyubok Lee, Jong Hak Moon, Seng Chan You, Seungjin Baek, Chang Hoon Han, Yoon Bin Jung, Yohan Jo, Edward Choi

TL;DR

The paper tackles privacy barriers in deploying clinical NLP by building Asclepius, a multi-task clinical LLM trained entirely on synthetic clinical notes derived from public case reports. It introduces a data-generation pipeline that converts case reports into realistic discharge notes and corresponding instruction-answer pairs, validated by perplexity analyses and clinician-guided prompts. Asclepius (7B and 13B) is trained via domain-adaptive pretraining on synthetic notes and instruction fine-tuning, and is evaluated against GPT-3.5-turbo and open-source models using real discharge summaries, with Asclepius-R (real-note baseline) as a reference. Across preliminary, practical, and professional evaluations, Asclepius demonstrates competitive performance and, in some cases, parity with models trained on real data, supporting the viability of synthetic notes for sharing high-quality clinical LLMs. The work emphasizes open access to data, models, and prompts, enabling broader research and potential clinical AI deployment while acknowledging limitations such as note-type generalization and single-turn interactions, and highlights ongoing concerns about hallucinations and clinical safety.

Abstract

The development of large language models tailored for handling patients' clinical notes is often hindered by the limited accessibility and usability of these notes due to strict privacy regulations. To address these challenges, we first create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. While Asclepius is trained on synthetic data, we assess its potential performance in real-world applications by evaluating it using real clinical notes. We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source alternatives. To further validate our approach using synthetic notes, we also compare Asclepius with its variants trained on real clinical notes. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones when constructing high-performing clinical language models. This conclusion is supported by detailed evaluations conducted by both GPT-4 and medical professionals. All resources including weights, codes, and data used in the development of Asclepius are made publicly accessible for future research. (https://github.com/starmpcc/Asclepius)

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

TL;DR

Abstract

Paper Structure (38 sections, 6 figures, 4 tables)

This paper contains 38 sections, 6 figures, 4 tables.

Introduction
Data Generation
Synthetic Clinical Notes
Clinical Instruction Generation
Clinical Large Language Model
Training
Evaluation
Comparative Analysis
Preliminary Evaluation
Practical Evaluation
Professional Evaluation
Related Work
Synthetic Clinical Notes
Language Models for Clinical NLP tasks
Conclusion
...and 23 more sections

Figures (6)

Figure 1: The large clinical language model, Asclepius, trained solely on synthetic clinical notes, can effectively handle various clinical NLP tasks on real notes in a zero-shot setting.
Figure 2: The first column is a part of the real discharge summary from MIMIC-III johnson2016mimic. Second is a case report from PMC-Patients zhao2023pmcpatients, and the third is the synthetic discharge summary created from this case report. Initially, the case report did not resemble the real clinical note in terms of format, but after the transformation, it more closely resembles the real clinical note. At the last column, there is an instruction and answer pair generated from synthetic clinical note. GPT-3.5-turbo was used in all generation processes.
Figure 3: The evaluation score from GPT-4 across diverse tasks and models. These tasks include: (A) MIMIC-III and MIMIC-IV (B) i2b2 and MTSamples (C) CASI (D) DiSCQ. The percentages listed beneath the GPT-4 scores represent the ratio of each model's score compared to the highest score achieved within that same model size category. The error bars represent a 95% confidence interval.
Figure 4: Professional and GPT-4 evaluation of Asclepius-13B and Asclepius-R-13B responses to 100 DiSCQ questions, featuring inter-professional Krippendorff's alpha ($\alpha$) agreement and GPT-4 to professional average alignment via Pearson, Kendall-Tau, and Spearman coefficients ($\sigma,\tau, \rho$). The error bars represent a 95% confidence interval.
Figure 5: Ablation Study on MIMIC-III (test set) Instructions
...and 1 more figures

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

TL;DR

Abstract

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Authors

TL;DR

Abstract

Table of Contents

Figures (6)