Table of Contents
Fetching ...

Skin cancer diagnosis using NIR spectroscopy data of skin lesions in vivo using machine learning algorithms

Flavio P. Loss, Pedro H. da Cunha, Matheus B. Rocha, Madson Poltronieri Zanoni, Leandro M. de Lima, Isadora Tavares Nascimento, Isabella Rezende, Tania R. P. Canuto, Luciana de Paula Vieira, Renan Rossoni, Maria C. S. Santos, Patricia Lyra Frasson, Wanderson Romão, Paulo R. Filgueiras, Renato A. Krohling

TL;DR

This work addresses the need for non-invasive, rapid skin cancer triage using near-infrared spectroscopy by introducing the NIR-SC-UFES in vivo spectral dataset and a comprehensive ML evaluation. It compares extreme gradient boosting variants, a 1D-CNN, and standard chemometric methods (PLS-DA, SVM) across raw, SNV-preprocessed, and feature-extracted data, with SMOTE and GAN-based augmentation to mitigate class imbalance. A key finding is that LightGBM with SNV preprocessing, subsequence feature extraction, and GAN-based augmentation achieves the best balanced performance (BA $0.839$, recall $0.851$, precision $0.852$, F-score $0.850$), while SHAP analysis highlights informative spectral regions around $939.072$ nm to $994.821$ nm. The results demonstrate the potential of NIR spectral data for automated in vivo skin cancer triage and provide a publicly valuable dataset to spur further research.

Abstract

Skin lesions are classified in benign or malignant. Among the malignant, melanoma is a very aggressive cancer and the major cause of deaths. So, early diagnosis of skin cancer is very desired. In the last few years, there is a growing interest in computer aided diagnostic (CAD) using most image and clinical data of the lesion. These sources of information present limitations due to their inability to provide information of the molecular structure of the lesion. NIR spectroscopy may provide an alternative source of information to automated CAD of skin lesions. The most commonly used techniques and classification algorithms used in spectroscopy are Principal Component Analysis (PCA), Partial Least Squares - Discriminant Analysis (PLS-DA), and Support Vector Machines (SVM). Nonetheless, there is a growing interest in applying the modern techniques of machine and deep learning (MDL) to spectroscopy. One of the main limitations to apply MDL to spectroscopy is the lack of public datasets. Since there is no public dataset of NIR spectral data to skin lesions, as far as we know, an effort has been made and a new dataset named NIR-SC-UFES, has been collected, annotated and analyzed generating the gold-standard for classification of NIR spectral data to skin cancer. Next, the machine learning algorithms XGBoost, CatBoost, LightGBM, 1D-convolutional neural network (1D-CNN) were investigated to classify cancer and non-cancer skin lesions. Experimental results indicate the best performance obtained by LightGBM with pre-processing using standard normal variate (SNV), feature extraction providing values of 0.839 for balanced accuracy, 0.851 for recall, 0.852 for precision, and 0.850 for F-score. The obtained results indicate the first steps in CAD of skin lesions aiming the automated triage of patients with skin lesions in vivo using NIR spectral data.

Skin cancer diagnosis using NIR spectroscopy data of skin lesions in vivo using machine learning algorithms

TL;DR

This work addresses the need for non-invasive, rapid skin cancer triage using near-infrared spectroscopy by introducing the NIR-SC-UFES in vivo spectral dataset and a comprehensive ML evaluation. It compares extreme gradient boosting variants, a 1D-CNN, and standard chemometric methods (PLS-DA, SVM) across raw, SNV-preprocessed, and feature-extracted data, with SMOTE and GAN-based augmentation to mitigate class imbalance. A key finding is that LightGBM with SNV preprocessing, subsequence feature extraction, and GAN-based augmentation achieves the best balanced performance (BA , recall , precision , F-score ), while SHAP analysis highlights informative spectral regions around nm to nm. The results demonstrate the potential of NIR spectral data for automated in vivo skin cancer triage and provide a publicly valuable dataset to spur further research.

Abstract

Skin lesions are classified in benign or malignant. Among the malignant, melanoma is a very aggressive cancer and the major cause of deaths. So, early diagnosis of skin cancer is very desired. In the last few years, there is a growing interest in computer aided diagnostic (CAD) using most image and clinical data of the lesion. These sources of information present limitations due to their inability to provide information of the molecular structure of the lesion. NIR spectroscopy may provide an alternative source of information to automated CAD of skin lesions. The most commonly used techniques and classification algorithms used in spectroscopy are Principal Component Analysis (PCA), Partial Least Squares - Discriminant Analysis (PLS-DA), and Support Vector Machines (SVM). Nonetheless, there is a growing interest in applying the modern techniques of machine and deep learning (MDL) to spectroscopy. One of the main limitations to apply MDL to spectroscopy is the lack of public datasets. Since there is no public dataset of NIR spectral data to skin lesions, as far as we know, an effort has been made and a new dataset named NIR-SC-UFES, has been collected, annotated and analyzed generating the gold-standard for classification of NIR spectral data to skin cancer. Next, the machine learning algorithms XGBoost, CatBoost, LightGBM, 1D-convolutional neural network (1D-CNN) were investigated to classify cancer and non-cancer skin lesions. Experimental results indicate the best performance obtained by LightGBM with pre-processing using standard normal variate (SNV), feature extraction providing values of 0.839 for balanced accuracy, 0.851 for recall, 0.852 for precision, and 0.850 for F-score. The obtained results indicate the first steps in CAD of skin lesions aiming the automated triage of patients with skin lesions in vivo using NIR spectral data.
Paper Structure (26 sections, 7 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: A photo sample of each type of skin lesion investigated in this work.
  • Figure 2: Acquisition of the NIR spectral data of a patient lesion using the Micronir portable spectrometer.
  • Figure 3: Spectral data sample of the six kind of skin lesions contained in the NIR-SC-UFES dataset.
  • Figure 4: Structure of the GAN network for generating NIR samples. A noise vector $Z$ passes through the generator ($G$) obtaining the generated output, and the discriminator $D$ is simultaneously trained to distinguish the generated signals from the real ones. The reconstruction loss measures how close the generated signals are to the real ones.
  • Figure 5: Cross Validation performed using k-fold splits.
  • ...and 1 more figures