Table of Contents
Fetching ...

Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

Tahar Chettaoui, Naser Damer, Fadi Boutros

TL;DR

The paper tackles demographic bias in face recognition by proposing Unified Text-Image Embedding (UTIE), a method that uses Vision-Language Models to infuse face embeddings with text-derived demographic features from non-dominant groups, creating more demographically ambiguous representations. UTIE computes the mean of non-predicted demographic text embeddings and adds it to the image embedding, reducing alignment with any single demographic class. Evaluations on RFW and BFW across CLIP, OpenCLIP, and SigLIP show consistent reductions in bias metrics (STD and SER) while preserving or improving verification accuracy. This approach demonstrates a practical bias-mitigation strategy that leverages cross-modal semantic alignment without retraining base FR systems, opening avenues for prompt engineering and robust multi-demographic representations.

Abstract

Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.

Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

TL;DR

The paper tackles demographic bias in face recognition by proposing Unified Text-Image Embedding (UTIE), a method that uses Vision-Language Models to infuse face embeddings with text-derived demographic features from non-dominant groups, creating more demographically ambiguous representations. UTIE computes the mean of non-predicted demographic text embeddings and adds it to the image embedding, reducing alignment with any single demographic class. Evaluations on RFW and BFW across CLIP, OpenCLIP, and SigLIP show consistent reductions in bias metrics (STD and SER) while preserving or improving verification accuracy. This approach demonstrates a practical bias-mitigation strategy that leverages cross-modal semantic alignment without retraining base FR systems, opening avenues for prompt engineering and robust multi-demographic representations.

Abstract

Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.

Paper Structure

This paper contains 11 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of our UTIE average cosine similarities with racial text embeddings against the baseline (IE) on the RFW subsets, using CLIP ViT/B-16. For each subset, we compute the average cosine similarities of the face embeddings with the different text embeddings $T_{i}$, for $i \in {1,..,4}$, representing the four considered races: African (blue), Asian (green), Caucasian (yellow), and Indian (red). (a) IE displays a high similarity with the corresponding demographic class $T_{\hat{i}}$ for all subsets while maintaining lower similarities with the other classes $T_{i}$ for $i \in {1,..,4}, \; i \neq \hat{i}$. This shows that face embeddings encode demographic-specific information. (b) For UTIE, the similarity to $T_{\hat{i}}$ is significantly lower compared to IE, and the similarities are spread more evenly across the other classes. This indicates that UTIE makes the racial information in the embeddings more ambiguous, reducing the alignment with a single demographic and distributing similarity more evenly across classes.
  • Figure 2: Demographically ambiguous embedding generation with UTIE for racial bias mitigation. (a) We use a text encoder $f_t(\cdot)$ to extract racial embeddings $T_{i}$, for $i \in {1,..,4}$ for zero-shot prediction. We then exclude the predicted embedding $T_{\hat{i}}$ (in this case $T_{1}$), compute the average of the remaining demographic embeddings, denoted as $\bar{T}$, and add it to the original image embedding $I$ resulting to $I'$. (b) The yellow vector represents the baseline IE. The blue vector represents our UTIE. This vector is closer to the unbiased axes, as it is more demographically ambiguous, as demonstrated in Section \ref{['sec:unified_text_image']}. Finally, the red vector represents IE+PTE defined in Section \ref{['sec:experimentalsetup']}. This vector shows stronger demographic dominance, in this case leaning more toward African identity, illustrating increased demographic influence.