Table of Contents
Fetching ...

TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone

Ana M. Cabanas, Alma Pedro, Domingo Mery

TL;DR

The paper investigates how skin-tone measurement choices affect fairness in Facial Affect Analysis (FAA). It compares the traditional Individual Typology Angle ($ITA$) with a perceptually grounded $L^*$-$H^*$ approach, including a brown-tone override, using AffectNet and a MobileNet classifier to assess performance across skin-tone groups. Across metrics such as F1-score disparity ($\uparrow$ up to $0.080$) and TPR disparity ($\uparrow$ up to $0.106$), the $L^*$-$H^*$ method yields more consistent subgrouping and clearer Equal Opportunity diagnostics, though it remains limited by underrepresentation of Dark skin tones. Grad-CAM analyses reveal that model attention patterns differ by skin tone, suggesting varying feature encoding and the need for robust explainability. The authors propose a modular fairness-aware FAA pipeline that integrates perceptual skin-tone estimation, interpretability tools, and fairness evaluation to guide future mitigation strategies.

Abstract

Understanding how facial affect analysis (FAA) systems perform across different demographic groups requires reliable measurement of sensitive attributes such as ancestry, often approximated by skin tone, which itself is highly influenced by lighting conditions. This study compares two objective skin tone classification methods: the widely used Individual Typology Angle (ITA) and a perceptually grounded alternative based on Lightness ($L^*$) and Hue ($H^*$). Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method. Results reveal a severe underrepresentation of dark skin tones ($\sim 2 \%$), alongside fairness disparities in F1-score (up to 0.08) and TPR (up to 0.11) across groups. While ITA shows limitations due to its sensitivity to lighting, the $H^*$-$L^*$ method yields more consistent subgrouping and enables clearer diagnostics through metrics such as Equal Opportunity. Grad-CAM analysis further highlights differences in model attention patterns by skin tone, suggesting variation in feature encoding. To support future mitigation efforts, we also propose a modular fairness-aware pipeline that integrates perceptual skin tone estimation, model interpretability, and fairness evaluation. These findings emphasize the relevance of skin tone measurement choices in fairness assessment and suggest that ITA-based evaluations may overlook disparities affecting darker-skinned individuals.

TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone

TL;DR

The paper investigates how skin-tone measurement choices affect fairness in Facial Affect Analysis (FAA). It compares the traditional Individual Typology Angle () with a perceptually grounded - approach, including a brown-tone override, using AffectNet and a MobileNet classifier to assess performance across skin-tone groups. Across metrics such as F1-score disparity ( up to ) and TPR disparity ( up to ), the - method yields more consistent subgrouping and clearer Equal Opportunity diagnostics, though it remains limited by underrepresentation of Dark skin tones. Grad-CAM analyses reveal that model attention patterns differ by skin tone, suggesting varying feature encoding and the need for robust explainability. The authors propose a modular fairness-aware FAA pipeline that integrates perceptual skin-tone estimation, interpretability tools, and fairness evaluation to guide future mitigation strategies.

Abstract

Understanding how facial affect analysis (FAA) systems perform across different demographic groups requires reliable measurement of sensitive attributes such as ancestry, often approximated by skin tone, which itself is highly influenced by lighting conditions. This study compares two objective skin tone classification methods: the widely used Individual Typology Angle (ITA) and a perceptually grounded alternative based on Lightness () and Hue (). Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method. Results reveal a severe underrepresentation of dark skin tones (), alongside fairness disparities in F1-score (up to 0.08) and TPR (up to 0.11) across groups. While ITA shows limitations due to its sensitivity to lighting, the - method yields more consistent subgrouping and enables clearer diagnostics through metrics such as Equal Opportunity. Grad-CAM analysis further highlights differences in model attention patterns by skin tone, suggesting variation in feature encoding. To support future mitigation efforts, we also propose a modular fairness-aware pipeline that integrates perceptual skin tone estimation, model interpretability, and fairness evaluation. These findings emphasize the relevance of skin tone measurement choices in fairness assessment and suggest that ITA-based evaluations may overlook disparities affecting darker-skinned individuals.

Paper Structure

This paper contains 13 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Distribution of ITA values in the cleaned AffectNet training set. Background shading reflects Fitzpatrick types; dashed lines mark our custom thresholds for Light (ITA $> 55^\circ$), Medium (30$^\circ$$\leq$ ITA $\leq$ 55$^\circ$), and Dark (ITA $< 30^\circ$).
  • Figure 2: Skin tone distribution in the $(H^*$–$L^*)$ color space. Point colors correspond to their respective RGB values. Dashed horizontal lines denote $L^*$ thresholds for skin tone classification: Light ($L^* > 67.0$), Medium ($37.0 \leq L^* \leq 67.0$), and Dark ($L^* < 37.0$).
  • Figure 3: Skin tone classification using the Individual Typology Angle (ITA). Images were randomly sampled from the training subset of the AffectNet dataset. While commonly used, ITA is sensitive to illumination and often misclassifies Medium and Dark skin tones.
  • Figure 4: Proposed classification using $L^*$ and Hue ($H^* = \arctan(b^*/a^*)$), capturing both lightness and chromaticity. Images were randomly sampled from the training subset of AffectNet. This method improves tone estimation, particularly under variable lighting and darker skin tones.
  • Figure 5: Distribution of labeled emotions in the AffectNet test set by skin tone groups classified using the $H^*$-$L^*$ method.
  • ...and 4 more figures