Table of Contents
Fetching ...

A Novel Perspective for Multi-modal Multi-label Skin Lesion Classification

Yuan Zhang, Yutong Xie, Hu Wang, Jodie C Avery, M Louise Hull, Gustavo Carneiro

TL;DR

This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multilabel TransFormer-based model (SkinM2Former), and introduces the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder.

Abstract

The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e., clinical+dermoscopic images, and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem, overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification, we introduce a multi-head attention (MHA) module to learn multi-label correlations, complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.

A Novel Perspective for Multi-modal Multi-label Skin Lesion Classification

TL;DR

This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multilabel TransFormer-based model (SkinM2Former), and introduces the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder.

Abstract

The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e., clinical+dermoscopic images, and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem, overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification, we introduce a multi-head attention (MHA) module to learn multi-label correlations, complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.
Paper Structure (18 sections, 8 equations, 3 figures, 7 tables)

This paper contains 18 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Multi-modal skin cancer classifier: (a) late fusion kawahara2018seven; (b) hybrid fusion of image and late fusion of metadata tang2022fusionm4netzhang2023tformer; (c) our hybrid fusion of all modalities.
  • Figure 2: (a) Imbalanced distribution of samples per class. (b) Inter-label Pearson Correlation Coefficients heatmap. Note that labels are denoted by "Classification problem ({DIAG,PN,...})-Possible classes({ABS,NEV,...)".
  • Figure 3: SkinM2Former: Tri-modal Cross-attention Transformer (TMCT) module to fuse all modalities; multi-head attention (MHA) layer to learn multi-label associations; and multi-label loss kobayashi2023two robust to class imbalances.