SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Zhixiang Lu; Shijie Xu; Kaicheng Yan; Xuyue Cai; Chong Zhang; Yulong Li; Angelos Stefanidis; Anh Nguyen; Jionglong Su

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Zhixiang Lu, Shijie Xu, Kaicheng Yan, Xuyue Cai, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Jionglong Su

Abstract

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Abstract

Paper Structure (25 sections, 9 equations, 4 figures, 4 tables)

This paper contains 25 sections, 9 equations, 4 figures, 4 tables.

Introduction
Related Work
Methodology
Efficient Multimodal Architecture
Frozen Visual Perception ($E_{img}$)
Multimedia Data Bridging
Focal Pooling & Feature Projection
LoRA-Adapted Generative Decoder ($E_{dec}$)
Consistency-Aware Focal Alignment (CFA) Loss
Imbalance-Resilient Classification ($\mathcal{L}_{focal}$)
Visual-Semantic Alignment ($\mathcal{L}_{align}$)
Theoretical Justification for Implicit Grounding
Calibration Regularization ($\mathcal{L}_{cal}$)
Generative Reasoning ($\mathcal{L}_{gen}$)
Experiments
...and 10 more sections

Figures (4)

Figure 1: Consistency-Aware Focal Alignment Architecture.
Figure 2: The overall framework of SkinCLIP-VL. The architecture consists of three key stages: (1) Meta-Data Enhancement: We leverage GPT-4o to expand tabular meta-data into comprehensive clinical descriptions, providing semantic guidance for the visual branch. (2) Parameter-Efficient Encoding: Instead of full fine-tuning, we employ a frozen CLIP visual encoder and a LoRA-adapted Qwen2.5-VL generative decoder to extract and align visual and textual features. (3) Consistency-Aware Focal Alignment (CFA): To capture fine-grained correlations, the Focal Pooling Layer aggregates the $N \times M$ interaction map into compact focal vectors. These vectors are fused via a Transformer Encoder under the joint supervision of focal and alignment losses ($\mathcal{L}_{focal}, \mathcal{L}_{align}$).
Figure 3: Hyperparameter sensitivity analysis.
Figure 4: Case study of dynamic visual grounding.

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Abstract

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Authors

Abstract

Table of Contents

Figures (4)