Table of Contents
Fetching ...

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, Goran Nenadic

TL;DR

This paper provides a foundational framework for generating diverse, de-identified clinical letters and offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain.

Abstract

Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio\_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models' understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks.

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

TL;DR

This paper provides a foundational framework for generating diverse, de-identified clinical letters and offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain.

Abstract

Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio\_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models' understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks.
Paper Structure (67 sections, 2 equations, 35 figures, 22 tables)

This paper contains 67 sections, 2 equations, 35 figures, 22 tables.

Figures (35)

  • Figure 1: An Example of the Objective: sentence/segment-level generations
  • Figure 2: An Example of LT3
  • Figure 3: An Input Example of Conditional Text Generation
  • Figure 4: Workflow of Discharge Summary Generation Using Clinical Guidelines
  • Figure 5: Workflow of MLM and CLM Comparison in Text Generation
  • ...and 30 more figures