Table of Contents
Fetching ...

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography

Hanna Kreutzer, Anne-Sophie Caselitz, Thomas Dratsch, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, Sven Nebelung

TL;DR

This study addresses data scarcity in AI for upper-extremity radiography by deploying GPT-4o to extract structured, uncertainty-aware labels from free-text radiology reports across clavicle, elbow, and thumb. The authors train region-specific multi-label CNNs (modified ResNet-50) using two uncertainty-handling schemes (inclusive vs exclusive) and validate on internal and external datasets, achieving high label-level accuracy (~98.1–99.0%) and macro-$AUC$ values around $0.76$–$0.81$. No significant performance differences were found between inclusive and exclusive labeling ($p\geq0.15$), even as uncertainty terms were sparsely detected; external validation demonstrated robust generalization for common findings but softer performance for rare, soft-tissue–related labels. The work demonstrates that LLM-based, uncertainty-aware label extraction from routine radiology reports can rapidly generate high-quality training data for multi-label musculoskeletal radiography and supports scalable, decentralized AI model development, though limitations include language scope and limited evaluation against human-annotated baselines.

Abstract

Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To assess the impact of label uncertainty, "uncertain" labels of the training and validation sets were automatically reassigned to "true" (inclusive) or "false" (exclusive). Label-image-pairs were used for multi-label classification using ResNet50. Label extraction accuracy was manually verified on internal (clavicle: n=233, elbow: n=745, thumb: n=393) and external test sets (n=300 for each). Performance was assessed using macro-averaged receiver operating characteristic (ROC) area under the curve (AUC), precision recall curves, sensitivity, specificity, and accuracy. AUCs were compared with the DeLong test. Results: Automatic extraction was correct in 98.6% (60,618 of 61,488) of labels in the test sets. Across anatomic regions, label-based model training yielded competitive performance measured by macro-averaged AUC values for inclusive (e.g., elbow: AUC=0.80 [range, 0.62-0.87]) and exclusive models (elbow: AUC=0.80 [range, 0.61-0.88]). Models generalized well on external datasets (elbow [inclusive]: AUC=0.79 [range, 0.61-0.87]; elbow [exclusive]: AUC=0.79 [range, 0.63-0.89]). No significant differences were observed across labeling strategies or datasets (p>=0.15). Conclusion: GPT-4o extracted labels from radiologic reports to train competitive multi-label classification models with high accuracy. Detected uncertainty in the radiologic reports did not influence the performance of these models.

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography

TL;DR

This study addresses data scarcity in AI for upper-extremity radiography by deploying GPT-4o to extract structured, uncertainty-aware labels from free-text radiology reports across clavicle, elbow, and thumb. The authors train region-specific multi-label CNNs (modified ResNet-50) using two uncertainty-handling schemes (inclusive vs exclusive) and validate on internal and external datasets, achieving high label-level accuracy (~98.1–99.0%) and macro- values around . No significant performance differences were found between inclusive and exclusive labeling (), even as uncertainty terms were sparsely detected; external validation demonstrated robust generalization for common findings but softer performance for rare, soft-tissue–related labels. The work demonstrates that LLM-based, uncertainty-aware label extraction from routine radiology reports can rapidly generate high-quality training data for multi-label musculoskeletal radiography and supports scalable, decentralized AI model development, though limitations include language scope and limited evaluation against human-annotated baselines.

Abstract

Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To assess the impact of label uncertainty, "uncertain" labels of the training and validation sets were automatically reassigned to "true" (inclusive) or "false" (exclusive). Label-image-pairs were used for multi-label classification using ResNet50. Label extraction accuracy was manually verified on internal (clavicle: n=233, elbow: n=745, thumb: n=393) and external test sets (n=300 for each). Performance was assessed using macro-averaged receiver operating characteristic (ROC) area under the curve (AUC), precision recall curves, sensitivity, specificity, and accuracy. AUCs were compared with the DeLong test. Results: Automatic extraction was correct in 98.6% (60,618 of 61,488) of labels in the test sets. Across anatomic regions, label-based model training yielded competitive performance measured by macro-averaged AUC values for inclusive (e.g., elbow: AUC=0.80 [range, 0.62-0.87]) and exclusive models (elbow: AUC=0.80 [range, 0.61-0.88]). Models generalized well on external datasets (elbow [inclusive]: AUC=0.79 [range, 0.61-0.87]; elbow [exclusive]: AUC=0.79 [range, 0.63-0.89]). No significant differences were observed across labeling strategies or datasets (p>=0.15). Conclusion: GPT-4o extracted labels from radiologic reports to train competitive multi-label classification models with high accuracy. Detected uncertainty in the radiologic reports did not influence the performance of these models.

Paper Structure

This paper contains 11 sections, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Data Curation and Preparation. Left: internal dataset (University Hospital Aachen, 2010–2024); right: external dataset (University Hospital Cologne, 2010-2022). Identical exclusion criteria were applied to both sources: patients $<$18 years, post-operative imaging, follow-up examinations, and studies after amputation. Pediatric cases had already been removed by the Cologne site (note beneath the age-exclusion box). After exclusions, the Cologne pool underwent region-stratified random sampling to 300 studies each of the clavicle (CL), elbow (EL), and thumb (TH). The internal data were split into training (64%), validation (16%), and internal test (20%) subsets. The external data served only for final testing. All final datasets comprise anteroposterior projections of the clavicle and both anteroposterior and lateral projections of the elbow and thumb.
  • Figure 2: Study Workflow. A) Radiography series of the clavicle, elbow, and thumb and corresponding radiologic reports were collected and curated. B) The LLM filled out a region-specific structured template containing relevant conditions based on the radiologic reports. Individual labels were either “true,” false,” or “uncertain.” The template was machine-readable and available in the JavaScript Object Notation (JSON) format. For subsequent model training, “uncertain” labels of the training/validation sets were automatically converted into “true” (inclusive labeling) or “false” (exclusive labeling) using Python. C) The JSON files were paired with the radiography series to train the image classification models as inclusive and exclusive versions using the respective labels. D) Both models were tested on internal and external test datasets containing manually corrected labels (ground truth).
  • Figure 3: Representative Radiography Series Illustrating Model Performance Across Different Labels and Anatomic Regions. Shown are true positives (TP; left) and false negatives (FN; right) for the label indicated underneath each radiography series consisting of anteroposterior (left) and lateral projections (right). For the clavicle (CL), the model correctly identified a middle-third fracture in one patient (TP) and misclassified it as a medial third fracture in another patient (FN). For the elbow (EL), a soft tissue calcification at the triceps tendon insertion was accurately detected in one patient (TP) and missed in another (FN), likely because of its fainter and more subtle appearance. For the thumb (TH), metacarpophalangeal joint degeneration was correctly identified in one patient (TP) and missed in another patient (FN), likely because of a fixed Boutonnière-like deformity and consecutive superimposition of the metacarpus. Block arrows indicate calcifications, while arrows indicate signs of degeneration.
  • Figure S1: Receiver-operating characteristic -and precision recall-curves of the Inclusive and Exclusive Models for the Clavicle for all Labels with n$\geq$10 in Both Datasets.
  • Figure S2: Receiver-operating characteristic -and precision recall-curves of the Inclusive and Exclusive Models for the Elbow for all Labels with n$\geq$10 in Both Datasets.
  • ...and 1 more figures