How many samples to label for an application given a foundation model? Chest X-ray classification study

Nikolay Nechaev; Evgeniia Przhezdzetskaia; Viktor Gombolevskiy; Dmitry Umerenkov; Dmitry Dylov

How many samples to label for an application given a foundation model? Chest X-ray classification study

Nikolay Nechaev, Evgeniia Przhezdzetskaia, Viktor Gombolevskiy, Dmitry Umerenkov, Dmitry Dylov

TL;DR

Problem: determine how many labeled chest X-ray samples are needed to reach clinical ROC-AUC targets when using foundation models. Method: fit a 3-parameter power-law $ROC_AUC(n) = alpha - beta / n^gamma$ to learning curves built from small labeled subsets, comparing RadDINO-Maira2, XrayCLIP, XraySigLIP, and ResNet-50. Findings: foundation models achieve higher ROC-AUC with far fewer labels and the early learning-curve slope reliably predicts the eventual performance plateau; plateau-prediction MAE decreases notably with as few as 50–100 labeled samples. Significance: enables budgeting annotation effort and guides deployment strategies for chest X-ray classifiers with practical data-efficient insights.

Abstract

Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.

How many samples to label for an application given a foundation model? Chest X-ray classification study

TL;DR

Problem: determine how many labeled chest X-ray samples are needed to reach clinical ROC-AUC targets when using foundation models. Method: fit a 3-parameter power-law

to learning curves built from small labeled subsets, comparing RadDINO-Maira2, XrayCLIP, XraySigLIP, and ResNet-50. Findings: foundation models achieve higher ROC-AUC with far fewer labels and the early learning-curve slope reliably predicts the eventual performance plateau; plateau-prediction MAE decreases notably with as few as 50–100 labeled samples. Significance: enables budgeting annotation effort and guides deployment strategies for chest X-ray classifiers with practical data-efficient insights.

How many samples to label for an application given a foundation model? Chest X-ray classification study

TL;DR

Abstract

How many samples to label for an application given a foundation model? Chest X-ray classification study

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)