Table of Contents
Fetching ...

Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems

Siddharth D Jaiswal, Animesh Ganai, Abhisek Dash, Saptarshi Ghosh, Animesh Mukherjee

TL;DR

This paper tackles the problem of biased facial recognition systems in global deployments by introducing FARFace, a Global South–centric dataset with eight-country coverage and adversarial variants to stress-test gender-prediction across commercial and open-source FRSs. It combines a rigorous audit (including Grad-CAM explainability) with low-resource bias mitigation via few-shot fine-tuning and contrastive learning, achieving substantial reductions in gender disparities, especially for Global South females, while highlighting robustness to adversarial inputs. A red-teaming country-prediction experiment exposes ethical concerns around inferring ethnicity or nationality from faces, underscoring the need for cautious deployment and stronger governance. The work demonstrates that simple, data-efficient techniques can meaningfully mitigate biases and improve cross-regional generalization, while also providing a framework for continual temporal audits and responsible AI in facial recognition applications.

Abstract

Facial Recognition Systems (FRSs) are being developed and deployed globally at unprecedented rates. Most platforms are designed in a limited set of countries but deployed in worldwide, without adequate checkpoints. This is especially problematic for Global South countries which lack strong legislation to safeguard persons facing disparate performance of these systems. A combination of unavailability of datasets, lack of understanding of FRS functionality and low-resource bias mitigation measures accentuate the problem. In this work, we propose a new face dataset composed of 6,579 unique male and female sportspersons from eight countries around the world. More than 50% of the dataset comprises individuals from the Global South countries and is demographically diverse. To aid adversarial audits and robust model training, each image has four adversarial variants, totaling over 40,000 images. We also benchmark five popular FRSs, both commercial and open-source, for the task of gender prediction (and country prediction for one of the open-source models as an example of red-teaming). Experiments on industrial FRSs reveal accuracies ranging from 98.2%--38.1%, with a large disparity between males and females in the Global South (max difference of 38.5%). Biases are also observed in all FRSs between females of the Global North and South (max difference of ~50%). Grad-CAM analysis identifies the nose, forehead and mouth as the regions of interest on one of the open-source FRSs. Utilizing this insight, we design simple, low-resource bias mitigation solutions using few-shot and novel contrastive learning techniques significantly improving the accuracy with disparity between males and females reducing from 50% to 1.5% in one of the settings. In the red-teaming experiment with the open-source Deepface model, contrastive learning proves more effective than simple fine-tuning.

Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems

TL;DR

This paper tackles the problem of biased facial recognition systems in global deployments by introducing FARFace, a Global South–centric dataset with eight-country coverage and adversarial variants to stress-test gender-prediction across commercial and open-source FRSs. It combines a rigorous audit (including Grad-CAM explainability) with low-resource bias mitigation via few-shot fine-tuning and contrastive learning, achieving substantial reductions in gender disparities, especially for Global South females, while highlighting robustness to adversarial inputs. A red-teaming country-prediction experiment exposes ethical concerns around inferring ethnicity or nationality from faces, underscoring the need for cautious deployment and stronger governance. The work demonstrates that simple, data-efficient techniques can meaningfully mitigate biases and improve cross-regional generalization, while also providing a framework for continual temporal audits and responsible AI in facial recognition applications.

Abstract

Facial Recognition Systems (FRSs) are being developed and deployed globally at unprecedented rates. Most platforms are designed in a limited set of countries but deployed in worldwide, without adequate checkpoints. This is especially problematic for Global South countries which lack strong legislation to safeguard persons facing disparate performance of these systems. A combination of unavailability of datasets, lack of understanding of FRS functionality and low-resource bias mitigation measures accentuate the problem. In this work, we propose a new face dataset composed of 6,579 unique male and female sportspersons from eight countries around the world. More than 50% of the dataset comprises individuals from the Global South countries and is demographically diverse. To aid adversarial audits and robust model training, each image has four adversarial variants, totaling over 40,000 images. We also benchmark five popular FRSs, both commercial and open-source, for the task of gender prediction (and country prediction for one of the open-source models as an example of red-teaming). Experiments on industrial FRSs reveal accuracies ranging from 98.2%--38.1%, with a large disparity between males and females in the Global South (max difference of 38.5%). Biases are also observed in all FRSs between females of the Global North and South (max difference of ~50%). Grad-CAM analysis identifies the nose, forehead and mouth as the regions of interest on one of the open-source FRSs. Utilizing this insight, we design simple, low-resource bias mitigation solutions using few-shot and novel contrastive learning techniques significantly improving the accuracy with disparity between males and females reducing from 50% to 1.5% in one of the settings. In the red-teaming experiment with the open-source Deepface model, contrastive learning proves more effective than simple fine-tuning.
Paper Structure (17 sections, 5 figures, 17 tables)

This paper contains 17 sections, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Images from our FARFace dataset. The first row has images from Global North -- Australia, New Zealand, England and South Africa. The second row has images from the Global South -- India, Bangladesh, Pakistan and West Indies. The third row shows the average face for each region -- Global North male, Global North female, Global South male and Global South female, generated by superimposing the images of individuals from each region.
  • Figure 2: Adversarial variants in the FARFace dataset, shown for an example image (original image in (a)).
  • Figure 3: Overall accuracy for all FRSs, segregated by image type, for all images (a) and for each gender group (b,c). On average, AWS and Deepface are the best performing commercial and open-source FRSs respectively, independent of the gender. All the FRSs are least robust to RGB$_{0.5}$ for both genders and MASK for females. The FRSs are AWS Rekognition (AWS), Microsoft Azure Face (MSFT), Face++ (FPP), Libfaceid (LIBFC), Deepface (DPFC).
  • Figure 4: Example activation maps from the Grad-CAM analysis of the ORIG set for Deepface. The images are ordered as -- Row 1: Males correctly predicted as male (New Zealand, Pakistan, Bangladesh), Row 2: Females incorrectly predicted as male (New Zealand, West Indies, South Africa), Row 3: Females correctly predicted as female (England, West Indies, Australia) and, Row 4: Males incorrectly predicted as female (England, India, India). For the images classified as male (first two rows), there is a more systematic region of interest, whereas the region of interest seems random for images classified as female (the last two rows). The last column in each row corresponds to the average Grad-CAM activation maps indicating the generalizability of our analysis.
  • Figure 5: Example activation maps from the Grad-CAM analysis of the ORIG set (held-out test set) for Deepface after two-shot fine-tuning on the ORIG set. The images are ordered as -- Row 1: Males correctly predicted as male (Australia, New Zealand, Pakistan, Bangladesh), Row 2: Females incorrectly predicted as male (West Indies, Bangladesh, Pakistan, India), Row 3: Females correctly predicted as female (England, West Indies, South Africa, India) and, Row 4: Males incorrectly predicted as female (Australia, West Indies, India, England). Row 5 has the average Grad-CAM activation maps for the images of males correctly predicted as male, females correctly predicted as females, females incorrectly predicted as males, males incorrectly predicted as females; it is apparent that there is a more systematic focus on the nose when females are being correctly predicted now.