Skin Cancer Detection utilizing Deep Learning: Classification of Skin Lesion Images using a Vision Transformer
Carolin Flosdorf, Justin Engelker, Igor Keller, Nicolas Mohr
TL;DR
This work investigates skin cancer detection from dermatoscopic images using Vision Transformer (ViT) architectures. By fine-tuning two large pre-trained ViT configurations (ViT_L16 and ViT_L32) on the HAM10000 dataset with data augmentation, the authors achieve superior accuracy compared to baseline decision trees, KNN, CNNs, and smaller ViT variants, with ViT_L16 reaching $92.79\%$ accuracy and ViT_L32 reaching $91.57\%$ accuracy; melanoma recall for ViT_L32 is $58.54\%$ and $56.10\%$ for ViT_L16. The results demonstrate the promise of ViTs for fast, accurate skin cancer screening, while acknowledging limitations from data imbalance, recall gaps for melanoma, interpretability concerns, and computational demands. Future work suggested includes balancing datasets, expanding to other skin diseases, and integrating ViT-based screening into clinical workflows or mobile tools.
Abstract
Skin cancer detection still represents a major challenge in healthcare. Common detection methods can be lengthy and require human assistance which falls short in many countries. Previous research demonstrates how convolutional neural networks (CNNs) can help effectively through both automation and an accuracy that is comparable to the human level. However, despite the progress in previous decades, the precision is still limited, leading to substantial misclassifications that have a serious impact on people's health. Hence, we employ a Vision Transformer (ViT) that has been developed in recent years based on the idea of a self-attention mechanism, specifically two configurations of a pre-trained ViT. We generally find superior metrics for classifying skin lesions after comparing them to base models such as decision tree classifier and k-nearest neighbor (KNN) classifier, as well as to CNNs and less complex ViTs. In particular, we attach greater importance to the performance of melanoma, which is the most lethal type of skin cancer. The ViT-L32 model achieves an accuracy of 91.57% and a melanoma recall of 58.54%, while ViT-L16 achieves an accuracy of 92.79% and a melanoma recall of 56.10%. This offers a potential tool for faster and more accurate diagnoses and an overall improvement for the healthcare sector.
