Table of Contents
Fetching ...

Are Models Biased on Text without Gender-related Language?

Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh

TL;DR

The paper introduces UnStereoEval (USE) to probe gender bias in stereotype-free sentences, challenging the assumption that bias solely arises from gendered words in training data. By defining word-gender correlations via PMI and enforcing minimal co-occurrences to build non-stereotypical benchmarks, the authors evaluate 28 LMs on USE-5/10/20 and WB/WG datasets. Across all models, fairness remains low (9-41%), with consistent male preference in stereotype-free benchmarks, suggesting that bias arises from deeper model behaviors beyond explicit gender cues. The work provides a systematic, reproducible evaluation framework and data release to guide future bias mitigation and safer deployment of language models.

Abstract

Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at https://ucinlp.github.io/unstereo-eval.

Are Models Biased on Text without Gender-related Language?

TL;DR

The paper introduces UnStereoEval (USE) to probe gender bias in stereotype-free sentences, challenging the assumption that bias solely arises from gendered words in training data. By defining word-gender correlations via PMI and enforcing minimal co-occurrences to build non-stereotypical benchmarks, the authors evaluate 28 LMs on USE-5/10/20 and WB/WG datasets. Across all models, fairness remains low (9-41%), with consistent male preference in stereotype-free benchmarks, suggesting that bias arises from deeper model behaviors beyond explicit gender cues. The work provides a systematic, reproducible evaluation framework and data release to guide future bias mitigation and safer deployment of language models.

Abstract

Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at https://ucinlp.github.io/unstereo-eval.
Paper Structure (26 sections, 4 equations, 6 figures, 22 tables)

This paper contains 26 sections, 4 equations, 6 figures, 22 tables.

Figures (6)

  • Figure 1: Preferences of 28 LM for three non-stereotypical sentence pairs. Despite being grammatically and semantically correct under both masculine ($s_M$) and feminine ($s_F$) completions and free of words with strong gender connotations, the majority of LM assigns more probability mass to one completion over the other.
  • Figure 2: Percentage of examples remaining after enforcing gender co-occurrences across 5 datasets (i.e., $|\mathrm{MaxPMI}({\mathbf{s}})| \leq \eta$). When $\eta=0.5$, three datasets preserve less than $35\%$ of its original sentences.
  • Figure 3: Overview of the pipeline for generating non-stereotypical benchmarks: 1) Word selection stage chooses seed words using a PMI-based score to guide sentence generation; 2) Sentence pairs generation stage produces sentences for each (gender, seed word) pair, followed by the creation of the opposite gender variant, and subsequent removal of unnatural pairs or any pair containing gender co-occurring words (operationalized as $|\mathrm{MaxPMI}({\mathbf{s}})| \leq \eta$).
  • Figure 4: Word-level distributions of $\mathrm{PMI}(w, \text{'she'})$ and $\mathrm{PMI}(w, \text{'he'})$ in PILE. The joint distribution is defined for words that co-occur with both "she" and "he". The fraction of well-defined functions is smaller for female pronouns.
  • Figure 5: Kendall Tau correlation coefficients for various parameterizations of $\delta(w)$. With the exception of relationship-specific or the pair "mummy"-"daddy", most parametrizations correlate positively with the original $\delta(w)$ definition.
  • ...and 1 more figures