Table of Contents
Fetching ...

Achieving Reliable and Fair Skin Lesion Diagnosis via Unsupervised Domain Adaptation

Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

TL;DR

This paper addresses the challenge of building reliable and fair skin lesion diagnoses when target labels are scarce. It evaluates unsupervised domain adaptation (UDA) across single-source, combined-source, and multi-source schemes to integrate six public datasets, using ROI preprocessing and a VGG16-BN feature extractor. The key finding is that multi-source UDA consistently outperforms single-source and non-DA baselines for both binary and multi-class tasks, and it reduces disparities across sensitive groups without explicit fairness interventions. The study also reveals a strong association between label shift and test error in multi-class settings, suggesting that diverse, well-aligned demographic information from multiple sources underpins the fairness gains. Collectively, the results highlight the practical value of leveraging diverse public data through UDA to achieve reliable and equitable dermatology AI in data-scarce scenarios.

Abstract

The development of reliable and fair diagnostic systems is often constrained by the scarcity of labeled data. To address this challenge, our work explores the feasibility of unsupervised domain adaptation (UDA) to integrate large external datasets for developing reliable classifiers. The adoption of UDA with multiple sources can simultaneously enrich the training set and bridge the domain gap between different skin lesion datasets, which vary due to distinct acquisition protocols. Particularly, UDA shows practical promise for improving diagnostic reliability when training with a custom skin lesion dataset, where only limited labeled data are available from the target domain. In this study, we investigate three UDA training schemes based on source data utilization: single-source, combined-source, and multi-source UDA. Our findings demonstrate the effectiveness of applying UDA on multiple sources for binary and multi-class classification. A strong correlation between test error and label shift in multi-class tasks has been observed in the experiment. Crucially, our study shows that UDA can effectively mitigate bias against minority groups and enhance fairness in diagnostic systems, while maintaining superior classification performance. This is achieved even without directly implementing fairness-focused techniques. This success is potentially attributed to the increased and well-adapted demographic information obtained from multiple sources.

Achieving Reliable and Fair Skin Lesion Diagnosis via Unsupervised Domain Adaptation

TL;DR

This paper addresses the challenge of building reliable and fair skin lesion diagnoses when target labels are scarce. It evaluates unsupervised domain adaptation (UDA) across single-source, combined-source, and multi-source schemes to integrate six public datasets, using ROI preprocessing and a VGG16-BN feature extractor. The key finding is that multi-source UDA consistently outperforms single-source and non-DA baselines for both binary and multi-class tasks, and it reduces disparities across sensitive groups without explicit fairness interventions. The study also reveals a strong association between label shift and test error in multi-class settings, suggesting that diverse, well-aligned demographic information from multiple sources underpins the fairness gains. Collectively, the results highlight the practical value of leveraging diverse public data through UDA to achieve reliable and equitable dermatology AI in data-scarce scenarios.

Abstract

The development of reliable and fair diagnostic systems is often constrained by the scarcity of labeled data. To address this challenge, our work explores the feasibility of unsupervised domain adaptation (UDA) to integrate large external datasets for developing reliable classifiers. The adoption of UDA with multiple sources can simultaneously enrich the training set and bridge the domain gap between different skin lesion datasets, which vary due to distinct acquisition protocols. Particularly, UDA shows practical promise for improving diagnostic reliability when training with a custom skin lesion dataset, where only limited labeled data are available from the target domain. In this study, we investigate three UDA training schemes based on source data utilization: single-source, combined-source, and multi-source UDA. Our findings demonstrate the effectiveness of applying UDA on multiple sources for binary and multi-class classification. A strong correlation between test error and label shift in multi-class tasks has been observed in the experiment. Crucially, our study shows that UDA can effectively mitigate bias against minority groups and enhance fairness in diagnostic systems, while maintaining superior classification performance. This is achieved even without directly implementing fairness-focused techniques. This success is potentially attributed to the increased and well-adapted demographic information obtained from multiple sources.
Paper Structure (19 sections, 7 figures, 5 tables)

This paper contains 19 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of single-, combined-, and multi-source UDA training for skin lesion classification. The shaded area demonstrates how single- and combined-source UDA operate. Specifically, combined-source UDA involves an additional step of aggregating multiple datasets into a single source, after which it follows the same training procedure as single-source UDA. The entire figure illustrates the multi-source training scheme, where all source domains (marked in orange and blue) are aligned with the target domain (marked in green) via a domain alignment component (marked in grey). A classifier is trained on each source domain independently, while also aligning with those trained on other source domains. All classifiers will be used to make target predictions during inference.
  • Figure 2: Feature distance (left), label distance (middle), and single-source test error for multi-class classification on the target test set (right). Pearson correlation coefficient is 0.31 between feature distance and test error and 0.78 between label distance and test error.
  • Figure 3: t-SNE figures of binary classification. Here target domain is fitz-roi and source domains are the other 5. Diagrams 1-2: combined-source without DA. Diagram 3-4: combined-source DANN. Diagrams are colored by domains (1, 3) and by labels (2, 4). After DANN, domains are well aligned and classes remain separable.
  • Figure 4: Fairness results on the Fitzpatrick17k, ISIC2020, and PAD-UFES-20 datasets for binary classification. Skin type is considered a sensitive attribute for Fitzpatrick17k, while age for ISIC2020 and PAD-UFES-20. All UDA-based methods in this experiment incorporate a weighted random sampler. Results can be found in Appendix \ref{['fairness_table']}
  • Figure 5: Image examples of each dataset considered in this study.
  • ...and 2 more figures