Table of Contents
Fetching ...

Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM

Xin Hu, Janet Wang, Jihun Hamm, Rie R Yotsu, Zhengming Ding

TL;DR

This work tackles skin disease diagnosis from noisy clinical photos by leveraging the Segment Anything Model (SAM) to generate two-level masks and extract local visual concepts. A cross-attentive fusion framework then integrates these local concepts with global image features to improve predictive accuracy and reliability. Interpretability is achieved via CAM-based weak supervision and a MIL loss, enabling explanation with the most contributory visual concepts. Evaluations on MIND-the-SKIN and SCIN datasets show consistent improvements over baselines and demonstrate robust, interpretable performance in real-world, non-ideal imaging conditions, highlighting potential for enhanced teledermatology and remote clinical decision support.

Abstract

Current AI-assisted skin image diagnosis has achieved dermatologist-level performance in classifying skin cancer, driven by rapid advancements in deep learning architectures. However, unlike traditional vision tasks, skin images in general present unique challenges due to the limited availability of well-annotated datasets, complex variations in conditions, and the necessity for detailed interpretations to ensure patient safety. Previous segmentation methods have sought to reduce image noise and enhance diagnostic performance, but these techniques require fine-grained, pixel-level ground truth masks for training. In contrast, with the rise of foundation models, the Segment Anything Model (SAM) has been introduced to facilitate promptable segmentation, enabling the automation of the segmentation process with simple yet effective prompts. Efforts applying SAM predominantly focus on dermatoscopy images, which present more easily identifiable lesion boundaries than clinical photos taken with smartphones. This limitation constrains the practicality of these approaches to real-world applications. To overcome the challenges posed by noisy clinical photos acquired via non-standardized protocols and to improve diagnostic accessibility, we propose a novel Cross-Attentive Fusion framework for interpretable skin lesion diagnosis. Our method leverages SAM to generate visual concepts for skin diseases using prompts, integrating local visual concepts with global image features to enhance model performance. Extensive evaluation on two skin disease datasets demonstrates our proposed method's effectiveness on lesion diagnosis and interpretability.

Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM

TL;DR

This work tackles skin disease diagnosis from noisy clinical photos by leveraging the Segment Anything Model (SAM) to generate two-level masks and extract local visual concepts. A cross-attentive fusion framework then integrates these local concepts with global image features to improve predictive accuracy and reliability. Interpretability is achieved via CAM-based weak supervision and a MIL loss, enabling explanation with the most contributory visual concepts. Evaluations on MIND-the-SKIN and SCIN datasets show consistent improvements over baselines and demonstrate robust, interpretable performance in real-world, non-ideal imaging conditions, highlighting potential for enhanced teledermatology and remote clinical decision support.

Abstract

Current AI-assisted skin image diagnosis has achieved dermatologist-level performance in classifying skin cancer, driven by rapid advancements in deep learning architectures. However, unlike traditional vision tasks, skin images in general present unique challenges due to the limited availability of well-annotated datasets, complex variations in conditions, and the necessity for detailed interpretations to ensure patient safety. Previous segmentation methods have sought to reduce image noise and enhance diagnostic performance, but these techniques require fine-grained, pixel-level ground truth masks for training. In contrast, with the rise of foundation models, the Segment Anything Model (SAM) has been introduced to facilitate promptable segmentation, enabling the automation of the segmentation process with simple yet effective prompts. Efforts applying SAM predominantly focus on dermatoscopy images, which present more easily identifiable lesion boundaries than clinical photos taken with smartphones. This limitation constrains the practicality of these approaches to real-world applications. To overcome the challenges posed by noisy clinical photos acquired via non-standardized protocols and to improve diagnostic accessibility, we propose a novel Cross-Attentive Fusion framework for interpretable skin lesion diagnosis. Our method leverages SAM to generate visual concepts for skin diseases using prompts, integrating local visual concepts with global image features to enhance model performance. Extensive evaluation on two skin disease datasets demonstrates our proposed method's effectiveness on lesion diagnosis and interpretability.
Paper Structure (16 sections, 3 equations, 6 figures, 5 tables)

This paper contains 16 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Generated masks on two skin samples with prompt "lesion" and "leg". The first row shows a "Buruli ulcer" image, in which the lesion part is clear to be detected as the visual concept, while in the second row - "Mycetoma", the lesion boundary is ambiguous to be recognized.
  • Figure 2: Overall framework of our proposed model, where we use Grounding-DINO and SAM to extract visual concepts, bounding boxes, and segmentation masks. The local encoder $\mathcal{F}_l(\cdot)$ converts visual concepts to local tokens and sets them as "query" prompts to trigger the salient area of the encoded global image with cross attentive module. The classifier $\mathcal{F}_c(\cdot)$ transfers the latent features to CAM for classification and interpretation for the decision-making process.
  • Figure 3: Confusion matrix with different top-$k$ strategies on the MIND-the-SKIN dataset. (a) is the confusion matrix of top-1; (b) shows the confusion matrix of top-5; (c) represents the confusion matrix of top-15.
  • Figure 4: Performance comparison between baseline(global) and fusion(local + global) methods with the MIND-the-SKIN dataset. (a) shows the confidence for each condition by the baseline method; (b) the class-wise confidence by our fusion method; (c) demonstrates one "Scabies" example that is wrongly recognized by the baseline method and correctly predicted by our proposed fusion model.
  • Figure 5: Visualization of data separability for the NTD (left) and SCIN (right) datasets after feature extraction by ViT.
  • ...and 1 more figures