Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

Candice Schumann; Gbolahan O. Olanubi; Auriel Wright; Ellis Monk; Courtney Heldreth; Susanna Ricco

Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

Candice Schumann, Gbolahan O. Olanubi, Auriel Wright, Ellis Monk, Courtney Heldreth, Susanna Ricco

TL;DR

This paper tackles the subjectivity of skin tone annotation for ML fairness by introducing the MST-E dataset and conducting two large-scale annotation studies. It shows that both expert photographers and trained crowdsourced annotators can reliably annotate skin tone on the MST scale, aligning with the scale creator's intent, under diverse lighting, and even in-the-wild images. However, regional cultural contexts shape annotations, underscoring the need for geographically diverse annotator pools and higher replication to achieve robust fairness measurements. The work provides practical guidelines for practitioners and releases MST-E to train annotators and support fair evaluation across the full MST spectrum, enhancing fairness analyses in computer vision systems.

Abstract

Understanding different human attributes and how they affect model behavior may become a standard need for all model creation and usage, from traditional computer vision tasks to the newest multimodal generative AI systems. In computer vision specifically, we have relied on datasets augmented with perceived attribute signals (e.g., gender presentation, skin tone, and age) and benchmarks enabled by these datasets. Typically labels for these tasks come from human annotators. However, annotating attribute signals, especially skin tone, is a difficult and subjective task. Perceived skin tone is affected by technical factors, like lighting conditions, and social factors that shape an annotator's lived experience. This paper examines the subjectivity of skin tone annotation through a series of annotation experiments using the Monk Skin Tone (MST) scale, a small pool of professional photographers, and a much larger pool of trained crowdsourced annotators. Along with this study we release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale. MST-E is designed to help train human annotators to annotate MST effectively. Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions. We also find evidence that annotators from different geographic regions rely on different mental models of MST categories resulting in annotations that systematically vary across regions. Given this, we advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.

Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

TL;DR

Abstract

Paper Structure (46 sections, 11 figures)

This paper contains 46 sections, 11 figures.

Introduction
Background and Related Work
Subjective Annotations
Skin Tone Research
Annotations for ML Fairness in Computer Vision
The Monk Skin Tone (MST) Scale and the MST-E Dataset
Consent and terms of use
The Annotation Experiment
Exploration with Experts
Inter-Rater Reliability
Regional Differences in Annotation
Distance from Intended Annotations
Exploration of Crowdsourced Annotations
Inter-Rater Reliability
Regional Differences in Annotation
...and 31 more sections

Figures (11)

Figure 1: Monk Skin Tone (MST) scale Monk2022Monk.
Figure 2: The MST-E dataset contains 19 subjects that span the full MST scale. Each subject has images in a variety of lighting conditions and poses. Images by TONL. Copyright TONL.CO 2022 ALL RIGHTS RESERVED. Used with permission.
Figure 3: Inter-rater reliability (IRR) between annotators calculated via intraclass correlation (ICC). Figure \ref{['fig:icc_experts']} shows the ICC scores for expert photographers in the US and India on the Challenging set and the Golden set. Figure \ref{['fig:icc_crowd']} shows the ICC scores for crowdsourced annotators in each region, as well as the global pool. These ICC scores are calculated per subject, as well as over the Challenging and Golden sets.
Figure 4: Photography experts in India (top plots) and the US (bottom plots) MST annotation counts for Subject 14. Figure \ref{['fig:golden_s14']} shows Subject 14's Golden Image chosen by the MST scale expert (§\ref{['sec:mste']}). Figure \ref{['fig:challenge_s14']} shows Subject 14's matching Challenging Image. In large-scale experiments with trained annotators (§\ref{['sec:crowdsource']}), we found consistently measurable differences in annotator behavior based on region.
Figure 5: Spearman-Brown inter-rater reliability calculations over replications of annotators on the MST-E dataset per subject, over the Challenging Images, and over the Golden Images.
...and 6 more figures

Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

TL;DR

Abstract

Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (11)