Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models
Cristiano Patrício, Luís F. Teixeira, João C. Neves
TL;DR
This work tackles the interpretability gap in skin lesion diagnosis by leveraging vision-language models to bypass extensive concept annotations. It introduces an embedding-learning approach that adapts CLIP with learnable image/text projections $W_I$ and $W_T$, aligning features via cosine similarity $S_c$ for same-disease pairs. The authors compare Baseline, CBM, and GPT+CBM strategies, showing that concept-based textual embeddings from expert dermoscopic concepts yield stronger performance and interpretable explanations, often surpassing original CLIP and approaching MONET with less training. The approach reduces annotation burden, provides concept-based explanations aligned with dermoscopic concepts, and demonstrates data-efficient melanoma detection across multiple dermoscopic datasets, with potential applicability to other imaging modalities.
Abstract
Concept-based models naturally lend themselves to the development of inherently interpretable skin lesion diagnosis, as medical experts make decisions based on a set of visual patterns of the lesion. Nevertheless, the development of these models depends on the existence of concept-annotated datasets, whose availability is scarce due to the specialized knowledge and expertise required in the annotation process. In this work, we show that vision-language models can be used to alleviate the dependence on a large number of concept-annotated samples. In particular, we propose an embedding learning strategy to adapt CLIP to the downstream task of skin lesion classification using concept-based descriptions as textual embeddings. Our experiments reveal that vision-language models not only attain better accuracy when using concepts as textual embeddings, but also require a smaller number of concept-annotated samples to attain comparable performance to approaches specifically devised for automatic concept generation.
