Table of Contents
Fetching ...

How many samples to label for an application given a foundation model? Chest X-ray classification study

Nikolay Nechaev, Evgeniia Przhezdzetskaia, Viktor Gombolevskiy, Dmitry Umerenkov, Dmitry Dylov

TL;DR

Problem: determine how many labeled chest X-ray samples are needed to reach clinical ROC-AUC targets when using foundation models. Method: fit a 3-parameter power-law $ROC_AUC(n) = alpha - beta / n^gamma$ to learning curves built from small labeled subsets, comparing RadDINO-Maira2, XrayCLIP, XraySigLIP, and ResNet-50. Findings: foundation models achieve higher ROC-AUC with far fewer labels and the early learning-curve slope reliably predicts the eventual performance plateau; plateau-prediction MAE decreases notably with as few as 50–100 labeled samples. Significance: enables budgeting annotation effort and guides deployment strategies for chest X-ray classifiers with practical data-efficient insights.

Abstract

Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.

How many samples to label for an application given a foundation model? Chest X-ray classification study

TL;DR

Problem: determine how many labeled chest X-ray samples are needed to reach clinical ROC-AUC targets when using foundation models. Method: fit a 3-parameter power-law to learning curves built from small labeled subsets, comparing RadDINO-Maira2, XrayCLIP, XraySigLIP, and ResNet-50. Findings: foundation models achieve higher ROC-AUC with far fewer labels and the early learning-curve slope reliably predicts the eventual performance plateau; plateau-prediction MAE decreases notably with as few as 50–100 labeled samples. Significance: enables budgeting annotation effort and guides deployment strategies for chest X-ray classifiers with practical data-efficient insights.

Abstract

Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: A chest X-ray example, with example pathologies.
  • Figure 2: ROC-AUC vs the number of training examples for lobe mass pathology.
  • Figure 3: Correlation between the derivatives of the fitted ROC-AUC at n=5 and the value of ROC-AUC(Nmax).
  • Figure 4: MAE between experimental ROC-AUC and ROC-AUC predicted on limited number of training examples.