Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning
Dionysis Christopoulos, Sotiris Spanos, Eirini Baltzi, Valsamis Ntouskos, Konstantinos Karantzalos
TL;DR
This work tackles skin-lesion classification under substantial image variability by integrating lesion appearance with both lesion-level and patient-level metadata through a nested multi-modal contrastive pre-training framework (SLIMP). The method combines two InfoNCE objectives into a joint loss $L_{total}=\lambda L_{lesions}+(1-\lambda)L_{patient}$ to learn cohesive image-metadata embeddings, and augments this with continual pre-training and a retrieval-based metadata extrapolation strategy to adapt to new datasets. Empirical results across five datasets show that SLIMP, especially with full multi-modal features and metadata extrapolation, substantially improves downstream classification and retrieval tasks, with notable gains in low-shot settings. The approach reduces reliance on extensive labels, supports dataset adaptation, and offers clinically relevant, interpretable representations, though it requires structured metadata and acknowledges domain-shift limitations.
Abstract
We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.
