Table of Contents
Fetching ...

SUDO: a framework for evaluating clinical artificial intelligence systems without ground-truth annotations

Dani Kiyasseh, Aaron Cohen, Chengsheng Jiang, Nicholas Altieri

TL;DR

SUDO can identify unreliable predictions, inform the selection of models, and allow for the previously out-of-reach assessment of algorithmic bias for data in the wild without ground-truth annotations, which can contribute to the deployment of trustworthy and ethical AI systems in medicine.

Abstract

A clinical artificial intelligence (AI) system is often validated on a held-out set of data which it has not been exposed to before (e.g., data from a different hospital with a distinct electronic health record system). This evaluation process is meant to mimic the deployment of an AI system on data in the wild; those which are currently unseen by the system yet are expected to be encountered in a clinical setting. However, when data in the wild differ from the held-out set of data, a phenomenon referred to as distribution shift, and lack ground-truth annotations, it becomes unclear the extent to which AI-based findings can be trusted on data in the wild. Here, we introduce SUDO, a framework for evaluating AI systems without ground-truth annotations. SUDO assigns temporary labels to data points in the wild and directly uses them to train distinct models, with the highest performing model indicative of the most likely label. Through experiments with AI systems developed for dermatology images, histopathology patches, and clinical reports, we show that SUDO can be a reliable proxy for model performance and thus identify unreliable predictions. We also demonstrate that SUDO informs the selection of models and allows for the previously out-of-reach assessment of algorithmic bias for data in the wild without ground-truth annotations. The ability to triage unreliable predictions for further inspection and assess the algorithmic bias of AI systems can improve the integrity of research findings and contribute to the deployment of ethical AI systems in medicine.

SUDO: a framework for evaluating clinical artificial intelligence systems without ground-truth annotations

TL;DR

SUDO can identify unreliable predictions, inform the selection of models, and allow for the previously out-of-reach assessment of algorithmic bias for data in the wild without ground-truth annotations, which can contribute to the deployment of trustworthy and ethical AI systems in medicine.

Abstract

A clinical artificial intelligence (AI) system is often validated on a held-out set of data which it has not been exposed to before (e.g., data from a different hospital with a distinct electronic health record system). This evaluation process is meant to mimic the deployment of an AI system on data in the wild; those which are currently unseen by the system yet are expected to be encountered in a clinical setting. However, when data in the wild differ from the held-out set of data, a phenomenon referred to as distribution shift, and lack ground-truth annotations, it becomes unclear the extent to which AI-based findings can be trusted on data in the wild. Here, we introduce SUDO, a framework for evaluating AI systems without ground-truth annotations. SUDO assigns temporary labels to data points in the wild and directly uses them to train distinct models, with the highest performing model indicative of the most likely label. Through experiments with AI systems developed for dermatology images, histopathology patches, and clinical reports, we show that SUDO can be a reliable proxy for model performance and thus identify unreliable predictions. We also demonstrate that SUDO informs the selection of models and allows for the previously out-of-reach assessment of algorithmic bias for data in the wild without ground-truth annotations. The ability to triage unreliable predictions for further inspection and assess the algorithmic bias of AI systems can improve the integrity of research findings and contribute to the deployment of ethical AI systems in medicine.
Paper Structure (25 sections, 4 equations, 4 figures, 1 table)

This paper contains 25 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: SUDO is a framework to evaluate AI systems without ground-truth labels.(a) an AI system is often deployed on data in the wild, which can vary significantly from those in the held-out set (distribution shift), and which can also lack ground-truth labels. (b) SUDO is a 5-step framework that circumvents the challenges posed by data in the wild. First, deploy an AI system on data in the wild to obtain probability values. Second, discretize those values into quantiles. Third, sample data points from each quantile and pseudo-label (temporarily label) them with a possible class (SUDO Class 0). Sample data points with ground-truth labels from the opposite class to form a classification task. Fourth, train a classifier to distinguish between these data points. Repeat the process with a different pseudo-label (SUDO Class 1). Finally, evaluate and compare the performance of the classifiers on the same held-out set of data with ground-truth labels, deriving the pseudo-label discrepancy. This discrepancy and the relative classifier performance indicate whether the sampled data points are more likely to belong to one class than another.
  • Figure 2: SUDO can be a reliable proxy for model performance on the Stanford diverse dermatology image dataset. Two models (left column: DeepDerm, right column: HAM10000) are pre-trained on the HAM10000 dataset and deployed on the entire Stanford DDI dataset. (a-b) Distribution of the prediction probability values produced by the two models colour-coded based on the ground-truth label (negative vs. positive) of the data points. (c-d) Correlation of SUDO with the proportion of positive data points in each probability interval. Results are shown for ten mutually-exclusive probability intervals that span the range $[0,1]$. A strong correlation indicates that SUDO can be used to identify unreliable predictions. (e) Reliability-completeness curves of the two models, where the area under the reliability-completeness curve (AURCC) can inform the selection of an AI system without ground-truth annotations.
  • Figure 3: SUDO can be a reliable proxy for model performance on the Camelyon17-WILDS histopathology dataset.(a) Distribution of the prediction probability values produced by a model colour-coded based on the ground-truth label (negative vs. positive) of the data points. (b) SUDO values colour-coded according to the most likely label of the predictions in each probability interval.
  • Figure 4: SUDO correlates with model performance and can identify unreliable predictions on the Flatiron Health ECOG Performance Status data without ground-truth annotations. Results are shown for the (left column) test set with ground-truth annotations and (right column) data in the wild without ground-truth annotations. (a-b) Distribution of the prediction probability values produced by an NLP model. (c-d) SUDO values colour-coded according to the most likely label of the predictions in each probability interval. (e-f) Survival curves for patient groups identified via (e) ground-truth annotations and (f) SUDO values: we identify reliable low ECOG PS predictions ($0<p<0.2$) and high ECOG PS predictions ($0.5<p<1.0$), and unreliable predictions ($0.2<p<0.5$). (g-h) Correlation between SUDO and proportion of positive instances in each probability interval (using ground-truth annotations) and the median survival time of patients in each probability interval (without ground-truth annotations).