Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics
Swarnava Bhattacharyya, Umapada Pal, Tapabrata Chakraborti
TL;DR
This work addresses the challenge of deploying powerful foundation models for skin lesion classification by incorporating predictive uncertainty quantification and demographic fairness. It combines conformal prediction with a dynamic F1-weighted sampler and leverages a DermFoundation backbone to produce per-sample uncertainty sets while guaranteeing population-level coverage. Validated on ISIC2019 and ASAN, the approach yields improvements for minority classes without sacrificing overall accuracy and reveals robust uncertainty signals across sex, age, and ethnicity. The framework is model- and task-agnostic, enabling safer clinical translation and progress toward personalized dermatology through per-patient conformal sets and transparent decision support.
Abstract
Deep learning based diagnostic AI systems based on medical images are starting to provide similar performance as human experts. However these data hungry complex systems are inherently black boxes and therefore slow to be adopted for high risk applications like healthcare. This problem of lack of transparency is exacerbated in the case of recent large foundation models, which are trained in a self supervised manner on millions of data points to provide robust generalisation across a range of downstream tasks, but the embeddings generated from them happen through a process that is not interpretable, and hence not easily trustable for clinical applications. To address this timely issue, we deploy conformal analysis to quantify the predictive uncertainty of a vision transformer (ViT) based foundation model across patient demographics with respect to sex, age and ethnicity for the tasks of skin lesion classification using several public benchmark datasets. The significant advantage of this method is that conformal analysis is method independent and it not only provides a coverage guarantee at population level but also provides an uncertainty score for each individual. We used a model-agnostic dynamic F1-score-based sampling during model training, which helped to stabilize the class imbalance and we investigate the effects on uncertainty quantification (UQ) with or without this bias mitigation step. Thus we show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model (Google DermFoundation) and thus advance the trustworthiness and fairness of clinical AI.
