SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes
Piyawoot Songsiritat
TL;DR
This work tackles the lack of Australian general practice clinical text by introducing SynGP500, a synthetic, privacy-preserving corpus of 500 notes. It threads curriculum-based breadth (RACGP 2022), BEACH-based prevalence calibration, SNOMED CT-AU mapping, and deployment diversity to produce realistic, messy notes. The authors validate the dataset through stylometric and semantic diversity analyses, epidemiological alignment, and a medical concept extraction evaluation showing meaningful learning signals. The resource aims to enable early-stage NLP development, education, and benchmarking for Australian primary care while avoiding patient privacy concerns and ethics approvals. Its framework is scalable to long-tail conditions and adaptable to other domains, potentially reducing domain shift in cross-jurisdictional clinical NLP research.
Abstract
We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.
