Table of Contents
Fetching ...

Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning

Dionysis Christopoulos, Sotiris Spanos, Eirini Baltzi, Valsamis Ntouskos, Konstantinos Karantzalos

TL;DR

This work tackles skin-lesion classification under substantial image variability by integrating lesion appearance with both lesion-level and patient-level metadata through a nested multi-modal contrastive pre-training framework (SLIMP). The method combines two InfoNCE objectives into a joint loss $L_{total}=\lambda L_{lesions}+(1-\lambda)L_{patient}$ to learn cohesive image-metadata embeddings, and augments this with continual pre-training and a retrieval-based metadata extrapolation strategy to adapt to new datasets. Empirical results across five datasets show that SLIMP, especially with full multi-modal features and metadata extrapolation, substantially improves downstream classification and retrieval tasks, with notable gains in low-shot settings. The approach reduces reliance on extensive labels, supports dataset adaptation, and offers clinically relevant, interpretable representations, though it requires structured metadata and acknowledges domain-shift limitations.

Abstract

We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.

Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning

TL;DR

This work tackles skin-lesion classification under substantial image variability by integrating lesion appearance with both lesion-level and patient-level metadata through a nested multi-modal contrastive pre-training framework (SLIMP). The method combines two InfoNCE objectives into a joint loss to learn cohesive image-metadata embeddings, and augments this with continual pre-training and a retrieval-based metadata extrapolation strategy to adapt to new datasets. Empirical results across five datasets show that SLIMP, especially with full multi-modal features and metadata extrapolation, substantially improves downstream classification and retrieval tasks, with notable gains in low-shot settings. The approach reduces reliance on extensive labels, supports dataset adaptation, and offers clinically relevant, interpretable representations, though it requires structured metadata and acknowledges domain-shift limitations.

Abstract

We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.

Paper Structure

This paper contains 44 sections, 2 equations, 9 figures, 22 tables.

Figures (9)

  • Figure 1: SLIMP architecture. An inner multi-modal contrastive loss is employed to maximize agreement among images of skin lesions and the corresponding metadata. Skin lesion image and metadata representations of a patient are aggregated, summarizing the lesion phenotype. At the patient level, agreement between the estimated lesion phenotype and the patient metadata is pursued through an outer contrastive loss.
  • Figure 2: Use of learned representations for skin lesion classification. Classification of a skin lesion using corresponding data modalities (image+metadata) is shown on the left. Classification of a skin lesion image using the retrieval-based metadata extrapolation method is shown on the right.
  • Figure 3: Cosine similarity distributions between image and metadata representations. On SLICE-3D validation hold-out set, we report the similarity to the ground-truth metadata and to metadata retrieved from the training set using SLIMP. For HAM10000, PAD-UFES-20, and HIBA, the retrieved-metadata distributions are shown. Dashed vertical lines indicate the median similarity for each distribution.
  • Figure 4: Class distribution within each dataset considered.
  • Figure 5: Normalized feature importance scores for patient-level and lesion-level features. The importance scores are derived from the attention mechanism of each tabular transformer respectively.
  • ...and 4 more figures