A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

Joakim Edin; Sedrah Butt Balaganeshan; Annike Kjølby Kristensen; Lars Maaløe; Ioannis Louloudis; Søren Brunak

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

Joakim Edin, Sedrah Butt Balaganeshan, Annike Kjølby Kristensen, Lars Maaløe, Ioannis Louloudis, Søren Brunak

TL;DR

A language model trained on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark to predict ICD-10 codes from clinical notes, medications, and laboratory results suggests under-coding of secondary diagnoses in Eastern Denmark during this period, with potential implications for epidemiological research, public health surveillance, and understanding of multimorbidity.

Abstract

Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006--2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53--91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and hypertension), the model identified thousands of uncoded cases, of which 76-86% were confirmed valid upon manual review, suggesting systematic under-coding rather than model error. These findings suggest under-coding of secondary diagnoses in Eastern Denmark during this period, with potential implications for epidemiological research, public health surveillance, and understanding of multimorbidity. Similar time constraints and reimbursement structures in other healthcare systems suggest this may not be isolated to this dataset. The model can automate coding for approximately 50% of cases and provide accurate suggestions for most others, and may offer a practical solution to help capture missed secondary conditions.

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

TL;DR

Abstract

Paper Structure (26 sections, 5 figures, 1 table)

This paper contains 26 sections, 5 figures, 1 table.

Introduction
Results
Large-scale data improves automated medical coding performance
Model-human agreement varies across medical specialties
ICD-10 codes with frequent human-model disagreements
What causes frequent human-model disagreements on secondary diagnoses?
Suicide-related behaviors
Weight disorders
Hypertension
Documentation quality affects model performance
Imprecise coding due to poor documentation visibility
Missing documentation with correct coding
Discussion
What causes the under-coding of secondary diagnoses?
Implications of poor coding quality for automated coding systems
...and 11 more sections

Figures (5)

Figure 1: The model's F1 score for each specialty. Each dot is the model's median F1 score for a specific department. The figure reveals a pattern in which the top specialties often involve planned admissions, while the bottom specialties represent patients with an average of many comorbidities. Departments with more than one hundred examples in the test set are included only.
Figure 2: a) The F1 score (y-axis) versus the number of occurrences in the training data (x-axis) for each ICD-10 code (level-3). b) The frequency of codes for secondary diagnoses in the training data (x-axis) and their F1 score (y-axis).
Figure 3: a) Recall@5 for primary diagnoses and secondary diagnoses, and b) Recall@10. Recall@5 and Recall@10 measure the frequency of human-annotated codes among the model's top 5 and top 10 predictions.
Figure 4: The data preprocessing pipeline for the study.
Figure 5: The interface we used for our manual explainability analysis. The clinical note is shown to the left, and the model's predictions are to the right. When hovering the mouse over a code prediction, the AttInGrad explanations are highlighted in the text of the clinical note. By visualizing the factors that caused the model prediction, we could quickly identify the elements that led to the human-model disagreements. The text in this example is AI-generated and does not relate to the data used in the study.

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

TL;DR

Abstract

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

Authors

TL;DR

Abstract

Table of Contents

Figures (5)