Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation

Shuzhou Sun; Li Liu; Yongxiang Liu; Zhen Liu; Shuanghui Zhang; Janne Heikkilä; Xiang Li

Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation

Shuzhou Sun, Li Liu, Yongxiang Liu, Zhen Liu, Shuanghui Zhang, Janne Heikkilä, Xiang Li

TL;DR

T Trident Probe Testing (TriProTesting) is introduced, a systematic testing method that detects explicit and implicit biases using semantically designed probes and proposes Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models.

Abstract

Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode trust in AI systems. To address this, we introduce Trident Probe Testing (TriProTesting), a systematic testing method that detects explicit and implicit biases using semantically designed probes. Here we show that FMs, including CLIP, ALIGN, BridgeTower, and OWLv2, demonstrate pervasive biases across single and mixed social attributes (gender, race, age, and occupation). Notably, we uncover mixed biases when social attributes are combined, such as gender x race, gender x age, and gender x occupation, revealing deeper layers of discrimination. We further propose Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models. These findings highlight the urgent need for ethical AI practices and interdisciplinary solutions to address biases not only at the model level but also in societal structures. Our work provides a scalable and interpretable solution that advances fairness in AI systems while offering practical insights for future research on fair AI technologies.

Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation

TL;DR

Abstract

Paper Structure (24 sections, 5 equations, 5 figures, 10 tables, 2 algorithms)

This paper contains 24 sections, 5 equations, 5 figures, 10 tables, 2 algorithms.

Introduction
Why harmful
How to test
Who is harmed
What can be done
Results and Discussion
Results of Single Bias Test
Results of Mixed Bias Test
Bias Mitigation with Adaptive Logit Adjustment
Conclusion
Method
Data Preparation
Probes Design and TriProTesting
Models Tested
Evaluation Metrics
...and 9 more sections

Figures (5)

Figure 1: Framework of bias analysis and mitigation in FMs.A, Illustration of probe testing. The input image depicts a chef, and the output includes two scenarios: unbiased prediction ("chef") and biased prediction triggered by a negative probe ("criminal"). This highlights the model's potential bias toward specific social groups. B, Three types of probes: Negative Probes, Positive Probes, and Neutral Probes. C, Datasets used for Single Bias Test: CelebA, UTKFace, FairFace, IdenProf. D, Extended datasets used for Mixed Bias Test: UTKFACE, FAIRFACE, IDENPROF, with additional labels (e.g., gender) to facilitate analysis of bias interactions across multiple social attributes. E, Comparison of the standard prediction process and the prediction process with logit adjustment, illustrating how logit adjustment redistributes probability power across categories to mitigate bias.
Figure 2: Bias analysis of FMs using Single Bias Test.A-D, FMs' prediction accuracy with probes included (bubbles' size) and the probability of being predicted as a probe (bubbles' color). E-H, The average probabilities of being predicted as probes for three models (CLIP, ALIGN, BridgeTower) across datasets. I, The average probabilities of being predicted as different types of probes for three models (CLIP, ALIGN, BridgeTower) across datasets. J, The top two largest probes predicted by OWLv2 for different classes.
Figure 3: Bias analysis of FMs using Mixed Bias Test.A-D, Mixed bias distributions for four FMs (CLIP, ALIGN, BridgeTower, OWLv2) across three extended datasets (UTKFACE, FAIRFACE, IDENPROF). The heatmap values represent the probability of "woman" groups being predicted as a probe minus the same for "man" groups. A positive value indicates that the "woman" groups is more likely to be predicted as the corresponding probe, while a negative value indicates the opposite. E, Average probabilities of social subgroups (e.g., white_man, police_woman) being predicted as different types of probes (Negative, Neutral, Positive) for three models (CLIP, ALIGN, BridgeTower).
Figure 4: Performance improvement of FMs with Adaptive Logit Adjustment (AdaLogAdjustment).A-D, Results of Single Bias Test, showing improvements in macro average accuracy for four FMs (CLIP, ALIGN, BridgeTower, OWLv2) across four datasets (CelebA, UTKFace, FairFace, IdenProf). E-H, Results of Mixed Bias Test, showing improvements in macro average accuracy for four FMs across three extended datasets (UTKFACE, FAIRFACE, IDENPROF).
Figure S1: Improved macro average accuracy through AdaLogAdjustment under different learning rates. Improved macro average accuracy is shown for four representative test scenarios: 1) testing UTKFace with the probe "fraudster" using CLIP, 2) testing FairFace with the probe "person" using ALIGN, 3) testing CelebA with the probe "leader" using BridgeTower, and 4) testing IdenProf with the probe "liar" using OWLv2. Learning rates tested include $0.0001$, $0.001$, $0.01$, $0.05$, $0.1$, and $0.5$.

Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation

TL;DR

Abstract

Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)