Table of Contents
Fetching ...

Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation

Junwen Zheng, Xinran Xu, Li Rong Wang, Chang Cai, Lucinda Siyun Tan, Dingyuan Wang, Hong Liang Tey, Xiuyi Fan

TL;DR

This work tackles the interpretability gap in melanoma diagnosis by introducing CEFM, a cross-modal framework that aligns Visions Transformer-derived image features with clinically defined ABCD criteria and uses a CLIP-guided LLM to generate structured diagnostic reports. It combines a clinically grounded ABC feature extraction pipeline, a cross-modal contrastive alignment module, and a domain-adapted language model to produce explanations that are both interpretable and clinically relevant. Empirical results show strong classification performance (92.79% accuracy, AUC 0.961) and robust segmentation, with expert dermatologists endorsing the approach for interpretability and clinical utility. The work highlights the potential of integrating visual and textual modalities to bridge high performance with clinician trust, while outlining future directions such as incorporating differential structures and multi-temporal lesion tracking for broader applicability.

Abstract

Deep learning has demonstrated expert-level performance in melanoma classification, positioning it as a powerful tool in clinical dermatology. However, model opacity and the lack of interpretability remain critical barriers to clinical adoption, as clinicians often struggle to trust the decision-making processes of black-box models. To address this gap, we present a Cross-modal Explainable Framework for Melanoma (CEFM) that leverages contrastive learning as the core mechanism for achieving interpretability. Specifically, CEFM maps clinical criteria for melanoma diagnosis-namely Asymmetry, Border, and Color (ABC)-into the Vision Transformer embedding space using dual projection heads, thereby aligning clinical semantics with visual features. The aligned representations are subsequently translated into structured textual explanations via natural language generation, creating a transparent link between raw image data and clinical interpretation. Experiments on public datasets demonstrate 92.79% accuracy and an AUC of 0.961, along with significant improvements across multiple interpretability metrics. Qualitative analyses further show that the spatial arrangement of the learned embeddings aligns with clinicians' application of the ABC rule, effectively bridging the gap between high-performance classification and clinical trust.

Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation

TL;DR

This work tackles the interpretability gap in melanoma diagnosis by introducing CEFM, a cross-modal framework that aligns Visions Transformer-derived image features with clinically defined ABCD criteria and uses a CLIP-guided LLM to generate structured diagnostic reports. It combines a clinically grounded ABC feature extraction pipeline, a cross-modal contrastive alignment module, and a domain-adapted language model to produce explanations that are both interpretable and clinically relevant. Empirical results show strong classification performance (92.79% accuracy, AUC 0.961) and robust segmentation, with expert dermatologists endorsing the approach for interpretability and clinical utility. The work highlights the potential of integrating visual and textual modalities to bridge high performance with clinician trust, while outlining future directions such as incorporating differential structures and multi-temporal lesion tracking for broader applicability.

Abstract

Deep learning has demonstrated expert-level performance in melanoma classification, positioning it as a powerful tool in clinical dermatology. However, model opacity and the lack of interpretability remain critical barriers to clinical adoption, as clinicians often struggle to trust the decision-making processes of black-box models. To address this gap, we present a Cross-modal Explainable Framework for Melanoma (CEFM) that leverages contrastive learning as the core mechanism for achieving interpretability. Specifically, CEFM maps clinical criteria for melanoma diagnosis-namely Asymmetry, Border, and Color (ABC)-into the Vision Transformer embedding space using dual projection heads, thereby aligning clinical semantics with visual features. The aligned representations are subsequently translated into structured textual explanations via natural language generation, creating a transparent link between raw image data and clinical interpretation. Experiments on public datasets demonstrate 92.79% accuracy and an AUC of 0.961, along with significant improvements across multiple interpretability metrics. Qualitative analyses further show that the spatial arrangement of the learned embeddings aligns with clinicians' application of the ABC rule, effectively bridging the gap between high-performance classification and clinical trust.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Advantages of the Cross-modal Explainable Framework (CEFM) over existing frameworks.
  • Figure 2: We propose a cross-modal melanoma diagnosis framework integrating four interconnected pipelines: a ViT-based classification pipeline that extracts semantic image features, a clinical explanation pipeline that segments lesions and quantifies ABC criteria, a contrastive module that aligns visual and clinical representations for interpretability, and a report generation module that produces structured diagnostic reports using CLIP descriptors and a domain-adapted LLM (DeepSeek).
  • Figure 3: Overview of the cross-modal alignment process. Clinical features extracted from lesion segmentations are projected into a shared latent space, while image features from a ViT encoder are jointly aligned via element-wise interactions to supervise contrastive learning.
  • Figure 4: The figure shows cosine similarity distributions of positive and negative pairs after contrastive learning. Positive pairs cluster in the high-similarity region ($>0.75$), whereas negative pairs occupy lower values, indicating effective cross-modal alignment.
  • Figure 5: Comparison of the risk assessments generated by models with components removed. Text highlighted in orange are resulted from the clinical explanation component; green are resulted from CLIP with direct prompts; the underline is generated by deepseek. With DeepSeek removed, the report is more fragmented.