Table of Contents
Fetching ...

DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

Francisco Filho, Kelvin Cunha, Fábio Papais, Emanoel dos Santos, Rodrigo Mota, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

TL;DR

The paper addresses severe class imbalance in skin lesion datasets by synthesizing class-balanced images via class-conditioned diffusion. It then uses masked autoencoder (MAE) self-supervised pretraining on real and synthetic data to learn robust, domain-relevant features in a large Vision Transformer, which are transferred to a compact model through soft-target knowledge distillation. Experiments on HAM10000 show that MAE pretraining with synthetic data consistently improves performance, with further gains from distillation and deployments to on-device capable architectures. The approach enables accurate dermatology classification in mobile settings, combining synthetic data generation, self-supervised learning, and distillation to overcome data scarcity and computational constraints.

Abstract

Skin lesion classification datasets often suffer from severe class imbalance, with malignant cases significantly underrepresented, leading to biased decision boundaries during deep learning training. We address this challenge using class-conditioned diffusion models to generate synthetic dermatological images, followed by self-supervised MAE pretraining to enable huge ViT models to learn robust, domain-relevant features. To support deployment in practical clinical settings, where lightweight models are required, we apply knowledge distillation to transfer these representations to a smaller ViT student suitable for mobile devices. Our results show that MAE pretraining on synthetic data, combined with distillation, improves classification performance while enabling efficient on-device inference for practical clinical use.

DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

TL;DR

The paper addresses severe class imbalance in skin lesion datasets by synthesizing class-balanced images via class-conditioned diffusion. It then uses masked autoencoder (MAE) self-supervised pretraining on real and synthetic data to learn robust, domain-relevant features in a large Vision Transformer, which are transferred to a compact model through soft-target knowledge distillation. Experiments on HAM10000 show that MAE pretraining with synthetic data consistently improves performance, with further gains from distillation and deployments to on-device capable architectures. The approach enables accurate dermatology classification in mobile settings, combining synthetic data generation, self-supervised learning, and distillation to overcome data scarcity and computational constraints.

Abstract

Skin lesion classification datasets often suffer from severe class imbalance, with malignant cases significantly underrepresented, leading to biased decision boundaries during deep learning training. We address this challenge using class-conditioned diffusion models to generate synthetic dermatological images, followed by self-supervised MAE pretraining to enable huge ViT models to learn robust, domain-relevant features. To support deployment in practical clinical settings, where lightweight models are required, we apply knowledge distillation to transfer these representations to a smaller ViT student suitable for mobile devices. Our results show that MAE pretraining on synthetic data, combined with distillation, improves classification performance while enabling efficient on-device inference for practical clinical use.
Paper Structure (11 sections, 5 equations, 2 figures, 1 table)

This paper contains 11 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the proposed framework. First, class-conditioned latent diffusion generates synthetic skin image. Next, the synthetic data is used to pretrain a ViT-H model using an MAE objective. Finally, knowledge distillation tunes a smaller student model from the pretrained ViT, which is fine-tuned using a combination of real and synthetic data.
  • Figure 2: Qualitative comparison of latent diffusion generation strategies. Rows (top to bottom): MSE loss, MSE + perceptual loss (unconditional), and MSE + perceptual loss (class-conditioned benign/malignant).