Synthetic4Health: Generating Annotated Synthetic Clinical Letters

Libo Ren; Samuel Belkadi; Lifeng Han; Warren Del-Pinto; Goran Nenadic

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, Goran Nenadic

TL;DR

This paper provides a foundational framework for generating diverse, de-identified clinical letters and offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain.

Abstract

Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio\_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models' understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks.

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

TL;DR

Abstract

Paper Structure (67 sections, 2 equations, 35 figures, 22 tables)

This paper contains 67 sections, 2 equations, 35 figures, 22 tables.

Introduction
Background and Literature Review
Development of Language Models (LMs)
Rule-Based Approach
Supervised Language Models
Unsupervised Language Models
Language Models Applications in Clinical Domain
Named Entity Recognition (NER)
De-Identification
Natural Language Generation (NLG)
Generative Language Models
Transformer and Attention Mechanism
Encoder-Only Models
Decoder-Only Models
Encoder-Decoder Models
...and 52 more sections

Figures (35)

Figure 1: An Example of the Objective: sentence/segment-level generations
Figure 2: An Example of LT3
Figure 3: An Input Example of Conditional Text Generation
Figure 4: Workflow of Discharge Summary Generation Using Clinical Guidelines
Figure 5: Workflow of MLM and CLM Comparison in Text Generation
...and 30 more figures

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

TL;DR

Abstract

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

Authors

TL;DR

Abstract

Table of Contents

Figures (35)