Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Hanna Kreutzer, Anne-Sophie Caselitz, Thomas Dratsch, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, Sven Nebelung
TL;DR
This study addresses data scarcity in AI for upper-extremity radiography by deploying GPT-4o to extract structured, uncertainty-aware labels from free-text radiology reports across clavicle, elbow, and thumb. The authors train region-specific multi-label CNNs (modified ResNet-50) using two uncertainty-handling schemes (inclusive vs exclusive) and validate on internal and external datasets, achieving high label-level accuracy (~98.1–99.0%) and macro-$AUC$ values around $0.76$–$0.81$. No significant performance differences were found between inclusive and exclusive labeling ($p\geq0.15$), even as uncertainty terms were sparsely detected; external validation demonstrated robust generalization for common findings but softer performance for rare, soft-tissue–related labels. The work demonstrates that LLM-based, uncertainty-aware label extraction from routine radiology reports can rapidly generate high-quality training data for multi-label musculoskeletal radiography and supports scalable, decentralized AI model development, though limitations include language scope and limited evaluation against human-annotated baselines.
Abstract
Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To assess the impact of label uncertainty, "uncertain" labels of the training and validation sets were automatically reassigned to "true" (inclusive) or "false" (exclusive). Label-image-pairs were used for multi-label classification using ResNet50. Label extraction accuracy was manually verified on internal (clavicle: n=233, elbow: n=745, thumb: n=393) and external test sets (n=300 for each). Performance was assessed using macro-averaged receiver operating characteristic (ROC) area under the curve (AUC), precision recall curves, sensitivity, specificity, and accuracy. AUCs were compared with the DeLong test. Results: Automatic extraction was correct in 98.6% (60,618 of 61,488) of labels in the test sets. Across anatomic regions, label-based model training yielded competitive performance measured by macro-averaged AUC values for inclusive (e.g., elbow: AUC=0.80 [range, 0.62-0.87]) and exclusive models (elbow: AUC=0.80 [range, 0.61-0.88]). Models generalized well on external datasets (elbow [inclusive]: AUC=0.79 [range, 0.61-0.87]; elbow [exclusive]: AUC=0.79 [range, 0.63-0.89]). No significant differences were observed across labeling strategies or datasets (p>=0.15). Conclusion: GPT-4o extracted labels from radiologic reports to train competitive multi-label classification models with high accuracy. Detected uncertainty in the radiologic reports did not influence the performance of these models.
