Table of Contents
Fetching ...

Deep Modeling and Optimization of Medical Image Classification

Yihang Wu, Muhammad Owais, Reem Kateb, Ahmad Chaddad

TL;DR

The paper tackles privacy and data-efficiency challenges in medical image classification by proposing a CLIP-based multimodal framework that combines CNN and ViT image encoders with a CLIP text encoder. It investigates two federated learning approaches (FedAVG and FedProx) to protect data while evaluating a comprehensive set of 12 backbones and three image encoders for skin and brain cancer tasks, supplemented by traditional ML classifiers using deep features. Key findings show maxvit_t achieves an average performance of 87.03% on HAM10000 with multimodal training, convnext_l performs strongly in federated settings, and SVM/KNN can substantially improve generalization on unseen domains. The work demonstrates the practicality of multimodal CLIP models with FL and classical ML support to enhance robustness in privacy-constrained medical imaging, with future directions including domain adaptation to mitigate cross-dataset distribution shifts.

Abstract

Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.

Deep Modeling and Optimization of Medical Image Classification

TL;DR

The paper tackles privacy and data-efficiency challenges in medical image classification by proposing a CLIP-based multimodal framework that combines CNN and ViT image encoders with a CLIP text encoder. It investigates two federated learning approaches (FedAVG and FedProx) to protect data while evaluating a comprehensive set of 12 backbones and three image encoders for skin and brain cancer tasks, supplemented by traditional ML classifiers using deep features. Key findings show maxvit_t achieves an average performance of 87.03% on HAM10000 with multimodal training, convnext_l performs strongly in federated settings, and SVM/KNN can substantially improve generalization on unseen domains. The work demonstrates the practicality of multimodal CLIP models with FL and classical ML support to enhance robustness in privacy-constrained medical imaging, with future directions including domain adaptation to mitigate cross-dataset distribution shifts.

Abstract

Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.

Paper Structure

This paper contains 9 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Flowchart of the proposed framework. 1) Data acquisition: Image data are preprocessed. 2) Proposed models: This involves three key components: multimodal learning, federated learning, and the combination of traditional ML and deep learning models. 3) Evaluation: we evaluate those models performance using standard classification metrics.
  • Figure 2: Test metrics on ISIC2018 dataset using deep models (Left), deep models with KNN (Middle) and deep models with SVM (Right).
  • Figure 3: Spider-plot of multimodal test metrics (%) for 12 deep models in HAM10000 and BraTS2019 using text encoder pretrained by three CLIP image encoders (ViT-L/14, ResNet50x16 and ResNet50x64).
  • Figure 4: Spider-plot of global test metrics (%) for 12 deep models in HAM10000 using FedAVG (First and third row) and FedProx (Second and third Row).