Robustness and sex differences in skin cancer detection: logistic regression vs CNNs
Nikolette Pedersen, Regitze Sydendal, Andreas Wulff, Ralf Raumanns, Eike Petersen, Veronika Cheplygina
TL;DR
The paper investigates sex-related biases and robustness in skin lesion detection by comparing a logistic regression model with handcrafted features to a ResNet-50 CNN, across varied training sex compositions using the PAD-UFES-20 dataset. It adopts a replication-style design, applying both feature-based and deep-learning approaches and evaluating performance with ACC and AUROC, while testing for sex-based differences under a Bonferroni-corrected threshold of $\alpha=0.006$. The results indicate both models are broadly robust to changes in training sex composition, but the CNN shows a statistically significant male advantage in AUROC (and ACC in some cases), whereas the LR model does not exhibit a clear sex bias. The study highlights the continued relevance of handcrafted features for robustness, underscores the impact of dataset characteristics (including image modality and data quality), and emphasizes the need for careful bias assessment and model choice in clinical ML pipelines, with broader implications for reproducibility and fairness in medical imaging.
Abstract
Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a previous study on Alzheimer's disease detection, which studied the robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with the replicated study: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distribution, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients compared to female patients. The data and relevant scripts to reproduce our results are publicly available (https://github.com/ nikodice4/Skin-cancer-detection-sex-bias).
