Table of Contents
Fetching ...

Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling

Samuel Belkadi, Libo Ren, Nicolo Micheletti, Lifeng Han, Goran Nenadic

TL;DR

The paper tackles privacy barriers in medical data by proposing a Masked Language Modeling (MLM) framework to generate synthetic free-text medical records. It introduces a two-stage Masker–Mask-Filling pipeline that de-identifies PHI and preserves key medical information via NER, using an encoder-only MLM (Bio_ClinicalBERT) to fill masked spans, achieving strong privacy and low inference cost with around $120\text{M}$ parameters. Empirical results show a HIPAA-PHI recall of $0.96$, re-identification risk of $0.035$, and downstream NER performance on synthetic data comparable to real data, with data augmentation further improving F1 to $0.836$ (near the real-data baseline $0.842$). The approach offers a controllable diversity–fidelity trade-off, robust privacy protection, and practical viability for medical NLP tasks, enabling safer data sharing and model development in healthcare contexts.

Abstract

The vast amount of available medical records has the potential to improve healthcare and biomedical research. However, privacy restrictions make these data accessible for internal use only. Recent works have addressed this problem by generating synthetic data using Causal Language Modeling. Unfortunately, by taking this approach, it is often impossible to guarantee patient privacy while offering the ability to control the diversity of generations without increasing the cost of generating such data. In contrast, we present a system for generating synthetic free-text medical records using Masked Language Modeling. The system preserves critical medical information while introducing diversity in the generations and minimising re-identification risk. The system's size is about 120M parameters, minimising inference cost. The results demonstrate high-quality synthetic data with a HIPAA-compliant PHI recall rate of 96% and a re-identification risk of 3.5%. Moreover, downstream evaluations show that the generated data can effectively train a model with performance comparable to real data.

Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling

TL;DR

The paper tackles privacy barriers in medical data by proposing a Masked Language Modeling (MLM) framework to generate synthetic free-text medical records. It introduces a two-stage Masker–Mask-Filling pipeline that de-identifies PHI and preserves key medical information via NER, using an encoder-only MLM (Bio_ClinicalBERT) to fill masked spans, achieving strong privacy and low inference cost with around parameters. Empirical results show a HIPAA-PHI recall of , re-identification risk of , and downstream NER performance on synthetic data comparable to real data, with data augmentation further improving F1 to (near the real-data baseline ). The approach offers a controllable diversity–fidelity trade-off, robust privacy protection, and practical viability for medical NLP tasks, enabling safer data sharing and model development in healthcare contexts.

Abstract

The vast amount of available medical records has the potential to improve healthcare and biomedical research. However, privacy restrictions make these data accessible for internal use only. Recent works have addressed this problem by generating synthetic data using Causal Language Modeling. Unfortunately, by taking this approach, it is often impossible to guarantee patient privacy while offering the ability to control the diversity of generations without increasing the cost of generating such data. In contrast, we present a system for generating synthetic free-text medical records using Masked Language Modeling. The system preserves critical medical information while introducing diversity in the generations and minimising re-identification risk. The system's size is about 120M parameters, minimising inference cost. The results demonstrate high-quality synthetic data with a HIPAA-compliant PHI recall rate of 96% and a re-identification risk of 3.5%. Moreover, downstream evaluations show that the generated data can effectively train a model with performance comparable to real data.
Paper Structure (23 sections, 2 figures, 4 tables)

This paper contains 23 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Design of the entire system, showcasing the Masker and Mask-Filling components.
  • Figure 2: Synthetic letters generated from letter 201-03 using System_I_0.7 (top) and System_S_0.5 (bottom).