Examining the Robustness of Homogeneity Bias to Hyperparameter Adjustments in GPT-4

Messi H. J. Lee

Examining the Robustness of Homogeneity Bias to Hyperparameter Adjustments in GPT-4

Messi H. J. Lee

TL;DR

This study investigates homogeneity bias in Vision-Language Models, focusing on GPT-4o mini outputs generated from GANFD-signalized faces and evaluated via cosine similarity of sentence embeddings. Using a controlled experimental setup and mixed-effects models, we show that homogeneity bias largely persists across a range of hyperparameters (sampling temperature and top-p) and exhibits non-linear patterns, with racial bias and gender bias responding differently to parameter changes. While certain hyperparameter adjustments can mitigate racial bias to some extent, they do not provide a universal solution, and the differential responses across social dimensions underscore the need for comprehensive, bias-mitigation strategies beyond tuning alone. The findings highlight implications for practitioners and point to future work on non-linear parameter spaces, open-source model analysis, and broader task contexts to better understand and address homogeneity bias in AI systems.

Abstract

Vision-Language Models trained on massive collections of human-generated data often reproduce and amplify societal stereotypes. One critical form of stereotyping reproduced by these models is homogeneity bias-the tendency to represent certain groups as more homogeneous than others. We investigate how this bias responds to hyperparameter adjustments in GPT-4, specifically examining sampling temperature and top p which control the randomness of model outputs. By generating stories about individuals from different racial and gender groups and comparing their similarities using vector representations, we assess both bias robustness and its relationship with hyperparameter values. We find that (1) homogeneity bias persists across most hyperparameter configurations, with Black Americans and women being represented more homogeneously than White Americans and men, (2) the relationship between hyperparameters and group representations shows unexpected non-linear patterns, particularly at extreme values, and (3) hyperparameter adjustments affect racial and gender homogeneity bias differently-while increasing temperature or decreasing top p can reduce racial homogeneity bias, these changes show different effects on gender homogeneity bias. Our findings suggest that while hyperparameter tuning may mitigate certain biases to some extent, it cannot serve as a universal solution for addressing homogeneity bias across different social group dimensions.

Examining the Robustness of Homogeneity Bias to Hyperparameter Adjustments in GPT-4

TL;DR

Abstract

Paper Structure (26 sections, 4 equations, 6 figures, 5 tables)

This paper contains 26 sections, 4 equations, 6 figures, 5 tables.

Introduction
Perceived Variability
Homogeneity Bias in Artificial Intelligence
This Work
Method
Signaling Group Identity
Selection of Vision-Language Models
Writing Prompt
First hyperparameter: sampling temperature
Second hyperparameter: top p
Homogeneity bias
Results
Sampling temperature has a non-linear effect on homogeneity of group representations
Homogeneity bias is robust to temperature
Temperature effects on racial and gender homogeneity bias
...and 11 more sections

Figures (6)

Figure 1: A three-step visualization of the study design, illustrating how the top p hyperparameter is adjusted to evaluate racial homogeneity bias across a range of hyperparameter values (i.e., 0.8 and 0.4).
Figure 2: Sample of facial stimuli from GANFD used to represent the four groups covered in this work.
Figure 3: Cosine similarity values of racial and gender groups across temperature values. Higher cosine similarity means more homogeneity in the stories generated for that group. Error bars represent one standard error above and below the mean.
Figure 4: Standardized cosine similarity values of racial and gender groups across temperature values. As these measurements are standardized across groups, they are meant to visualize the magnitude of homogeneity bias across temperature values. Error bars represent one standard error above and below the mean.
Figure 5: Cosine similarity values of racial and gender groups across top p values. Higher cosine similarity means more homogeneity in the stories generated for that group. Error bars represent one standard error above and below the mean.
...and 1 more figures

Examining the Robustness of Homogeneity Bias to Hyperparameter Adjustments in GPT-4

TL;DR

Abstract

Examining the Robustness of Homogeneity Bias to Hyperparameter Adjustments in GPT-4

Authors

TL;DR

Abstract

Table of Contents

Figures (6)