Table of Contents
Fetching ...

Exploring Disparity-Accuracy Trade-offs in Face Recognition Systems: The Role of Datasets, Architectures, and Loss Functions

Siddharth D Jaiswal, Sagnik Basu, Sandipan Sikdar, Animesh Mukherjee

TL;DR

The paper investigates how three core components—model architecture, loss function, and face image datasets—jointly shape the accuracy-disparity trade-off in face recognition systems performing gender prediction. It conducts a large-scale audit across three FRS backbones (LibfaceID, ViT-Face, InstructBLIP), seven diverse datasets, and four loss functions, totaling 266 configurations. Key findings show that dataset choice strongly influences perceived bias and that larger, more complex models can reduce disparity, though the direction of bias can flip across datasets; the interaction with embeddings and loss functions is highly architecture-dependent. The results offer practical recommendations for developers and platform operators and highlight ethical considerations and the need for debiasing and localized deployment strategies in real-world settings.

Abstract

Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.

Exploring Disparity-Accuracy Trade-offs in Face Recognition Systems: The Role of Datasets, Architectures, and Loss Functions

TL;DR

The paper investigates how three core components—model architecture, loss function, and face image datasets—jointly shape the accuracy-disparity trade-off in face recognition systems performing gender prediction. It conducts a large-scale audit across three FRS backbones (LibfaceID, ViT-Face, InstructBLIP), seven diverse datasets, and four loss functions, totaling 266 configurations. Key findings show that dataset choice strongly influences perceived bias and that larger, more complex models can reduce disparity, though the direction of bias can flip across datasets; the interaction with embeddings and loss functions is highly architecture-dependent. The results offer practical recommendations for developers and platform operators and highlight ethical considerations and the need for debiasing and localized deployment strategies in real-world settings.

Abstract

Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.

Paper Structure

This paper contains 51 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Interdependence between the architecture, data and loss function that determine the performance of a model, which in turn informs the choice of components.
  • Figure 2: Schematics for the Libfaceid levi2015age model modified only with residual connections (\ref{['fig:libfaceid-skip']}), and with extra layers and residual connections (\ref{['fig:libfaceid-skip2']}).
  • Figure 3: Accuracy vs. absolute disparity for different architectures across all datasets and loss functions. CFD always reports high accuracy, whereas FARFace and CelebA primarily report high disparities. On every architecture, each dataset has a similar performance for all loss functions that it is evaluated with. The shapes refer to the datasets, with colors referring to the various loss functions. Each combination of dataset and loss function refers to one experimental setup. The model complexity, choice of loss function and choice of dataset impact both the accuracy and the disparity.
  • Figure 4: Heatmaps indicating the extent and the direction of disparity between the two genders for different architectures across all datasets and loss functions. CelebSET is always disparate against males, and FARFace is always disparate against females, independent of all other factors, despite being balanced datasets. The color codes are as follows-- red indicates higher accuracy for males, and blue indicates higher accuracy for females. The intensity of the color signals the magnitude of the gender disparity.
  • Figure 5: Relative change in accuracy for males vs. females, for all losses when compared against only the CE loss, for all FRSs and datasets. The diagonal line indicates an equal relative change for both genders and points on either side imply a larger relative change for the respective gender. The effect of tinkering with the models has an impact on the relative change in accuracy for both genders.
  • ...and 7 more figures