Can the accuracy bias by facial hairstyle be reduced through balancing the training data?

Kagan Ozturk; Haiyu Wu; Kevin W. Bowyer

Can the accuracy bias by facial hairstyle be reduced through balancing the training data?

Kagan Ozturk, Haiyu Wu, Kevin W. Bowyer

TL;DR

This work investigates whether accuracy biases induced by facial hairstyles can be reduced by increasing training data size or by balancing hair distributions in training sets. Through controlled experiments using AdaFace on multiple WebFace-derived scales and a MORPH-based evaluation, the authors show that while overall recognition improves with more data, the accuracy gap between clean-shaven and facial-hair image pairs persists, and balancing training data does not elimina e this gap. They further test data augmentation that alters beard and mustache regions, observing some gains but no fundamental elimination of cross-hair bias, with effects differing across races. The findings imply that hairstyle-related fairness issues are not solved by data quantity or simple balancing, underscoring the importance of rigorous bias evaluation and more robust mitigation strategies in face recognition systems.

Abstract

Appearance of a face can be greatly altered by growing a beard and mustache. The facial hairstyles in a pair of images can cause marked changes to the impostor distribution and the genuine distribution. Also, different distributions of facial hairstyle across demographics could cause a false impression of relative accuracy across demographics. We first show that, even though larger training sets boost the recognition accuracy on all facial hairstyles, accuracy variations caused by facial hairstyles persist regardless of the size of the training set. Then, we analyze the impact of having different fractions of the training data represent facial hairstyles. We created balanced training sets using a set of identities available in Webface42M that both have clean-shaven and facial hair images. We find that, even when a face recognition model is trained with a balanced clean-shaven / facial hair training set, accuracy variation on the test data does not diminish. Next, data augmentation is employed to further investigate the effect of facial hair distribution in training data by manipulating facial hair pixels with the help of facial landmark points and a facial hair segmentation model. Our results show facial hair causes an accuracy gap between clean-shaven and facial hair images, and this impact can be significantly different between African-Americans and Caucasians.

Can the accuracy bias by facial hairstyle be reduced through balancing the training data?

TL;DR

Abstract

Paper Structure (8 sections, 2 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 8 sections, 2 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Does recognition bias due to facial hairstyle decrease with larger training data?
How does the facial hair distribution in training data affect the performance?
Impact of facial hair variation across subjects
Impact of facial hair variation within subjects
Facial hair area and Casia-WebFace
Conclusion and Discussion

Figures (6)

Figure 1: The effect of training size on recognition is given for African-American (AAM) and Caucasian (CM) males on MORPH using 4 training sets: a facial hair-balanced subset of WebFace42M (120K images Sec. \ref{['sec:within subjects']}), Casia-WebFace (500K images), WebFace4M (4M images) and WebFace12M (12M images). Higher d-prime between genuine and impostor distribution means better recognition accuracy. While d-prime values for CS-CS (clean-shaven v. clean-shaven), CS-FH (clean-shaven v. facial hair) and FH-FH (facial hair v. facial hair) image pairs consistently increase as training data gets larger, the d-prime gap across facial hairstyles also increases. AdaFace kim2022adaface loss is used to train the models.
Figure 2: Impostor and genuine distributions for CS-CS, CS-FH, and FH-FH image pairs. Similarity scores are obtained using a pretrained AdaFace model on WebFace12M. D-prime values are given at the upper-left of the plots. While the recognition performance is better for Caucasian males on CS-CS pairs (b), d-prime values are greater for African-American males on CS-FH and FH-FH pairs (a). Examples of AAM FH-FH (c) and CM CS-FH (d) genuine pairs are shown.
Figure 3: Mean faces of clean-shaven (CS) and facial hair (FH) image sets for African-American (AAM) and Caucasian (CM) males on MORPH.
Figure 4: Percentage of number of facial hair pixels in a FH image for Caucasian and African-American males on MORPH.
Figure 5: Effect of facial hair distribution in training set. Recognition performance is measured on African-American (a) and Caucasian (b) males on MORPH. D-prime values are shown for CS-CS (clean-shaven versus clean-shaven), CS-FH (clean-shaven versus facial hair) and FH-FH (facial hair versus facial hair) image pairs. Dashed lines show the effect of facial hair ratio variation within subjects (Section \ref{['sec:within subjects']}) and solid lines shows the variation across subjects (Section \ref{['sec:across subjects']}). Vertical bars show the standard deviation of 5 repetition. Lower d-prime values are observed in most cases as facial hair percentage exceed $50\%$ in training data.
...and 1 more figures

Can the accuracy bias by facial hairstyle be reduced through balancing the training data?

TL;DR

Abstract

Can the accuracy bias by facial hairstyle be reduced through balancing the training data?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)