Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

Tahar Chettaoui; Naser Damer; Fadi Boutros

Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

Tahar Chettaoui, Naser Damer, Fadi Boutros

TL;DR

The paper tackles demographic bias in face recognition by proposing Unified Text-Image Embedding (UTIE), a method that uses Vision-Language Models to infuse face embeddings with text-derived demographic features from non-dominant groups, creating more demographically ambiguous representations. UTIE computes the mean of non-predicted demographic text embeddings and adds it to the image embedding, reducing alignment with any single demographic class. Evaluations on RFW and BFW across CLIP, OpenCLIP, and SigLIP show consistent reductions in bias metrics (STD and SER) while preserving or improving verification accuracy. This approach demonstrates a practical bias-mitigation strategy that leverages cross-modal semantic alignment without retraining base FR systems, opening avenues for prompt engineering and robust multi-demographic representations.

Abstract

Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.

Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

TL;DR

Abstract

Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)