How many samples to label for an application given a foundation model? Chest X-ray classification study
Nikolay Nechaev, Evgeniia Przhezdzetskaia, Viktor Gombolevskiy, Dmitry Umerenkov, Dmitry Dylov
TL;DR
Problem: determine how many labeled chest X-ray samples are needed to reach clinical ROC-AUC targets when using foundation models. Method: fit a 3-parameter power-law $ROC_AUC(n) = alpha - beta / n^gamma$ to learning curves built from small labeled subsets, comparing RadDINO-Maira2, XrayCLIP, XraySigLIP, and ResNet-50. Findings: foundation models achieve higher ROC-AUC with far fewer labels and the early learning-curve slope reliably predicts the eventual performance plateau; plateau-prediction MAE decreases notably with as few as 50–100 labeled samples. Significance: enables budgeting annotation effort and guides deployment strategies for chest X-ray classifiers with practical data-efficient insights.
Abstract
Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.
