INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

Di Jin; Xing Liu; Yu Liu; Jia Qing Yap; Andrea Wong; Adriana Crespo; Qi Lin; Zhiyuan Yin; Qiang Yan; Ryan Ye

INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

Di Jin, Xing Liu, Yu Liu, Jia Qing Yap, Andrea Wong, Adriana Crespo, Qi Lin, Zhiyuan Yin, Qiang Yan, Ryan Ye

TL;DR

INFELM delivers an in-depth fairness evaluation framework for large text-to-image models by coupling a novel skintone classifier that fuses facial topology with skin pixels, with bias-sensitive content alignment metrics and a representation-bias score. It assesses outputs across six social domains and nine industrial T2I models, revealing that representation bias generally surpasses alignment errors and that many models fail the four-fifth fairness rule. DALL-E 3 shows the strongest gender fairness, while Openjourney excels in skintone alignment; base Stable Diffusion models tend to be more fair than their fine-tuned RealisticVision counterparts. The work provides a robust, scalable benchmark for ethical multi-modal AI development and highlights directions for improving demographic fairness in generated imagery.

Abstract

The rapid development of large language models (LLMs) and large vision models (LVMs) have propelled the evolution of multi-modal AI systems, which have demonstrated the remarkable potential for industrial applications by emulating human-like cognition. However, they also pose significant ethical challenges, including amplifying harmful content and reinforcing societal biases. For instance, biases in some industrial image generation models highlighted the urgent need for robust fairness assessments. Most existing evaluation frameworks focus on the comprehensiveness of various aspects of the models, but they exhibit critical limitations, including insufficient attention to content generation alignment and social bias-sensitive domains. More importantly, their reliance on pixel-detection techniques is prone to inaccuracies. To address these issues, this paper presents INFELM, an in-depth fairness evaluation on widely-used text-to-image models. Our key contributions are: (1) an advanced skintone classifier incorporating facial topology and refined skin pixel representation to enhance classification precision by at least 16.04%, (2) a bias-sensitive content alignment measurement for understanding societal impacts, (3) a generalizable representation bias evaluation for diverse demographic groups, and (4) extensive experiments analyzing large-scale text-to-image model outputs across six social-bias-sensitive domains. We find that existing models in the study generally do not meet the empirical fairness criteria, and representation bias is generally more pronounced than alignment errors. INFELM establishes a robust benchmark for fairness assessment, supporting the development of multi-modal AI systems that align with ethical and human-centric principles.

INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 9 figures, 7 tables)

This paper contains 20 sections, 6 equations, 9 figures, 7 tables.

Introduction
Related Work
Preliminaries
Method
Gender Classification
Skintone Classification
Synthetic facial image generation and topology classifier
Dominant skin pixel extraction
Skintone classification
Fairness testing framework
Testing prompts
Fairness metrics
Experiments & Results
Skintone classification
Text-to-image model fairness analysis
...and 5 more sections

Figures (9)

Figure 1: Skintone confusion. Due to the unevenly distributed color spectrum and lighting disturbance, color-extraction and shortest-distance-based methods incorrectly group facial images into similar groups, which impedes the correctness of downstream analysis. In this example, most images are incorrectly labeled with Monk scale 6 due to its dominance in the Euclidean-represented RGB space of skintones.
Figure 2: INFELM overview. Given a bias-sensitive domain with fairness risks (e.g. wealth, education), INFELM first feeds the prompts following the groundtruth distribution to a text-to-image model to generate images in scale. Then, the demographics classifiers automatically generate the corresponding labels, which are used for fairness analysis. In the end, INFELM outputs the comprehensive fairness analysis and the bias score.
Figure 3: Highlight of skintone pixel representations: mean, dominant pixel distributions before Otsu's method. The aggregated mean pixel representation makes it less distinguishable while the ordinal distributions (ordered by pixel weights) preserve more detailed information.
Figure 4: Skintone classification model architecture based on multi-modality feature fusion
Figure 5: Fairness analysis of Stable Diffusion v1.4
...and 4 more figures

INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

TL;DR

Abstract

INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)