Table of Contents
Fetching ...

Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

Vitor Pereira Matias, Márcus Vinícius Lobo Costa, João Batista Neto, Tiago Novello de Brito

TL;DR

This work provides state-of-the-art results in skin tone classification and fairness assessment and proposes SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2.

Abstract

Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon

Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

TL;DR

This work provides state-of-the-art results in skin tone classification and fairness assessment and proposes SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2.

Abstract

Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon
Paper Structure (11 sections, 8 figures, 4 tables)

This paper contains 11 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Teaser: Mapping Demographic Diversity to the Monk Skin Tone (MST) Scale via embedding projection: The 10-tone MST scale (middle) and its human examples (bottom) offer a continuous representation of human skin reflectance, overcoming the limitations of traditional 6-tone categorical scales. (Top) A 1D t-SNE projection of DINOv3 ViT-S embeddings from our Skin Tone in The Wild (STW) dataset reveals a clear and continuous learned by our network. The alignment of clusters and the MST classes demonstrates our model to generalize to in-the-wild (ITW) data.
  • Figure 2: Confusion matrices between annotators. Annotators 1 vs. 2 show similar pattern. (Zoom for details.)
  • Figure 3: Dataset distribution over different categories for all datasets: top left: images per class; top-right: individuals per class; bottom: distribution of images per class based on the individual image multiplicity. (Zoom for details.)
  • Figure 4: Methodology for skin tone classification. Our pipeline integrates data from multiple open-access sources via an "innovative annotation interface". To ensure robust evaluation, we implement class-balancing on the training set and employ two distinct partitioning strategies: Split by Images (IMG) and Split by Individuals (IND). These sets are processed through both handcrafted (ccvm) and deep learning (dl) pipelines, ensuring that our method is reliable and data-leakage resistant.
  • Figure 5: Confusion matrix analysis of the CCVm pipeline. Despite balancing, the model collapses towards frequent labels.
  • ...and 3 more figures