Table of Contents
Fetching ...

Failure of contextual invariance in gender inference with large language models

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

Abstract

Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

Failure of contextual invariance in gender inference with large language models

Abstract

Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.
Paper Structure (18 sections, 5 equations, 4 figures, 1 table)

This paper contains 18 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Discourse context destabilises pronoun generation probabilities across all models. (a-e) Distribution, over all templates, of empirical generation probabilities. The height of the curve at the left extrema of each plot indicates the fraction of templates for which, in that context setting, $\hat{p}(f|c) = 0$ (the model always generates the masculine pronoun in context setting $c$). The inverse is true for the right extrema, indicating the fraction of templates for which $\hat{p}(f|c) = 1$. (f-j) Average Kullback-Liebler Divergence, $\langle \mathrm{KL}\left(c\,\|\,c_\emptyset\right)\rangle$ (in bits), between the distribution of responses for each template in the unprimed setting, $c_\emptyset$, and each of the primed settings tested, $c \in \{c_f, c_m, c_{0,1}, c_{0,2}\}$.
  • Figure 2: Correlations with cultural stereotypes vanish once discourse context is introduced. Spearman correlation between the cultural gender stereotypes as measured in Ref. misersky_norms_2014 and empirical generation probabilities in the unprimed ($c_\emptyset)$, masuline-primed ($c_m$), and feminine-primed ($c_f$) context settings measured in our experiment, with significance ($***:=p<.001$, $** := p<.01$, $*:=p<.05$).
  • Figure 3: Discourse context shifts predictive dominance towards the priming pronoun, away from cultural and syntactic features. Average mutual information (in bits) between features of the prompt and the pronoun generated by the model in unprimed settings and primed settings with standard error. Prompt features: 'Role Type' refers to the type of role of the antecedent of the target pronoun, 'Stereotype' refers to cultural stereotypes about the antecedent. 'Case' refers to the grammatical case of the pronoun, 'Order' refers to the linear order of options presented in the prompt, and 'Priming Pronoun' refers to the gender of the priming pronoun (present only in the primed settings).
  • Figure 4: Contextual effects vary across models and are sensitive to the linear order of pronoun options. (a) Fraction of template pairs which exhibited contextual measurements, disambiguated by the linear order of pronoun options presented in the prompt. (b) Number of template pairs which exhibited contextual measurements in both members of each model pair.