Deep Modeling and Optimization of Medical Image Classification
Yihang Wu, Muhammad Owais, Reem Kateb, Ahmad Chaddad
TL;DR
The paper tackles privacy and data-efficiency challenges in medical image classification by proposing a CLIP-based multimodal framework that combines CNN and ViT image encoders with a CLIP text encoder. It investigates two federated learning approaches (FedAVG and FedProx) to protect data while evaluating a comprehensive set of 12 backbones and three image encoders for skin and brain cancer tasks, supplemented by traditional ML classifiers using deep features. Key findings show maxvit_t achieves an average performance of 87.03% on HAM10000 with multimodal training, convnext_l performs strongly in federated settings, and SVM/KNN can substantially improve generalization on unseen domains. The work demonstrates the practicality of multimodal CLIP models with FL and classical ML support to enhance robustness in privacy-constrained medical imaging, with future directions including domain adaptation to mitigate cross-dataset distribution shifts.
Abstract
Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.
