Table of Contents
Fetching ...

Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

John Cook, Michael Wyatt, Peng Wei, Iris Chin, Santosh Gupta, Van Zyl Van Vuuren, Richie Siburian, Amanda Spicer, Kristen Viviano, Alda Cami, Raunaq Malhotra, Zhewei Yao, Jeff Rasley, Gaurav Kaushik

Abstract

Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.

Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

Abstract

Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.
Paper Structure (28 sections, 6 figures, 3 tables)

This paper contains 28 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Figure 1. Exact-match ICD-10-CM and CPT coding. F1 scores for the fine-tuned Llama-3-70B model evaluated on held-out synthetic (A) ICD-10-CM and (B) CPT datasets. Results reflect baseline inference without prompt engineering or semantic retrieval assistance.
  • Figure 2: Figure 2. ICD-10-CM coding performance across hierarchical levels. ICD-10-CM performance is reported at increasing levels of diagnostic specificity, from coarse category identification to exact ICD-10 code (Level 3) and exact code with supporting evidence attribution (Level 4). Performance is highest at the coarsest level and declines gradually as diagnostic specificity increases. CPT coding performance, evaluated independently using exact set matching, achieves an F1 score of 0.736. Results reflect baseline inference without prompt engineering or semantic retrieval assistance.
  • Figure 3: Breakdown of ICD-10-CM coding performance across clinical domains and diagnostic groupings. (a) Weighted mean F1 score by clinical domain for Advanced Illness, Frailty, and Social Determinants of Health (SDoH). Error bars represent the weighted standard deviations. (b) Lowest-performing ICD-10-CM category-level codes with at least 10 evaluation cases. (c) Mean F1 score by ICD-10-CM chapter. Results are computed over 761 held-out synthetic clinical charts.
  • Figure 4: Figure 4. Training workflow including data augmentation and sequence packing. A. Synthetic notes and linked ICD-10-CM and CPT labels are bundled. The training dataset is augmented by concatenating multiple notes to increase contextual difficulty and formatted using structured prompts. This amounts to 30% of the original volume of notes. Data augmentation and packing are applied during training only. B. Samples are prepared in a format including system prompt, instruction, and output format. C. Samples are packed into fixed-length sequences to reduce padding and improve computational efficiency. D. The resulting sequences are used for supervised fine-tuning of the Llama-3-70B model.
  • Figure 5: Figure 5. Relationship between ICD-10-CM training frequency and evaluation performance. Scatter plot of category-level F1 score in the evaluation set versus the number of occurrences in the training data (categories with $\geq$10 evaluation cases). Low-frequency categories exhibit high variance in performance, while higher-frequency categories consistently achieve strong F1 scores, indicating a frequency threshold effect rather than a linear relationship.
  • ...and 1 more figures