Table of Contents
Fetching ...

Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models

Cristiano Patrício, Luís F. Teixeira, João C. Neves

TL;DR

This work tackles the interpretability gap in skin lesion diagnosis by leveraging vision-language models to bypass extensive concept annotations. It introduces an embedding-learning approach that adapts CLIP with learnable image/text projections $W_I$ and $W_T$, aligning features via cosine similarity $S_c$ for same-disease pairs. The authors compare Baseline, CBM, and GPT+CBM strategies, showing that concept-based textual embeddings from expert dermoscopic concepts yield stronger performance and interpretable explanations, often surpassing original CLIP and approaching MONET with less training. The approach reduces annotation burden, provides concept-based explanations aligned with dermoscopic concepts, and demonstrates data-efficient melanoma detection across multiple dermoscopic datasets, with potential applicability to other imaging modalities.

Abstract

Concept-based models naturally lend themselves to the development of inherently interpretable skin lesion diagnosis, as medical experts make decisions based on a set of visual patterns of the lesion. Nevertheless, the development of these models depends on the existence of concept-annotated datasets, whose availability is scarce due to the specialized knowledge and expertise required in the annotation process. In this work, we show that vision-language models can be used to alleviate the dependence on a large number of concept-annotated samples. In particular, we propose an embedding learning strategy to adapt CLIP to the downstream task of skin lesion classification using concept-based descriptions as textual embeddings. Our experiments reveal that vision-language models not only attain better accuracy when using concepts as textual embeddings, but also require a smaller number of concept-annotated samples to attain comparable performance to approaches specifically devised for automatic concept generation.

Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models

TL;DR

This work tackles the interpretability gap in skin lesion diagnosis by leveraging vision-language models to bypass extensive concept annotations. It introduces an embedding-learning approach that adapts CLIP with learnable image/text projections and , aligning features via cosine similarity for same-disease pairs. The authors compare Baseline, CBM, and GPT+CBM strategies, showing that concept-based textual embeddings from expert dermoscopic concepts yield stronger performance and interpretable explanations, often surpassing original CLIP and approaching MONET with less training. The approach reduces annotation burden, provides concept-based explanations aligned with dermoscopic concepts, and demonstrates data-efficient melanoma detection across multiple dermoscopic datasets, with potential applicability to other imaging modalities.

Abstract

Concept-based models naturally lend themselves to the development of inherently interpretable skin lesion diagnosis, as medical experts make decisions based on a set of visual patterns of the lesion. Nevertheless, the development of these models depends on the existence of concept-annotated datasets, whose availability is scarce due to the specialized knowledge and expertise required in the annotation process. In this work, we show that vision-language models can be used to alleviate the dependence on a large number of concept-annotated samples. In particular, we propose an embedding learning strategy to adapt CLIP to the downstream task of skin lesion classification using concept-based descriptions as textual embeddings. Our experiments reveal that vision-language models not only attain better accuracy when using concepts as textual embeddings, but also require a smaller number of concept-annotated samples to attain comparable performance to approaches specifically devised for automatic concept generation.
Paper Structure (10 sections, 4 figures, 2 tables)

This paper contains 10 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The workflow of our proposed strategy. After learning the new multi-modal embedding space (left), we predict the presence of melanoma by linearly combining the similarity scores with the melanoma coefficients acting as the bottleneck layer of CBM. The result of this operation is then compared with a threshold value to predict the presence or absence of melanoma.
  • Figure 2: Evaluation results (in BACC %) of the different classification strategies (Baseline, CBM and GPT+CBM) on three datasets (PH$^2$, Derm7pt and ISIC 2018) for melanoma detection. Black-box linear probing performance is marked with $\filledstar$.
  • Figure 3: Computational performance analysis of our proposed embedding learning procedure.
  • Figure 4: Examples of dermoscopic images classified based on dermoscopic concepts.