Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools
Francisca Silva-Clavería, Carmen Serrano, Iván Matas, Amalia Serrano, Tomás Toledo-Pastrana, Begoña Acha
TL;DR
The paper tackles the subjectivity in dermoscopic criteria for basal cell carcinoma and argues that a ground-truth derived from multiple dermatologists is essential to train AI that can explain its diagnosis via detected dermoscopic features. It analyzes two datasets—204 cases for GT consensus and 1230 cases for AI training—using interrater metrics and GT inference methods (majority voting and expectation maximization). The authors show that while dermatologist diagnoses align well with biopsy, agreement on specific dermoscopic patterns is variable, and the inferred GTs smooth out outlier judgments, impacting AI learning. The findings underscore the importance of multi-rater GT construction for reliable, explainable AI in dermatology and provide a framework for evaluating GTs and their effect on AI performance in BCC classification and pattern detection.
Abstract
Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.
