Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

Emily McMilin

Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

Emily McMilin

TL;DR

The paper tackles underspecification in language modeling by proposing a simple causality-informed mechanism in which incomplete task specification induces latent selection bias, leading to spurious gender biases such as time- and location-associated pronoun preferences. It introduces two lightweight black-box evaluation methods: (1) measuring correlations between injected time/location cues and gendered pronoun predictions (Method 1), and (2) a specification-detection metric that flags unspecified tasks at inference (Method 2). Across a broad spectrum of models from BERT-base to GPT-4 Turbo Preview, the study finds that model size has limited impact on these specification-induced correlations, while post-training objectives like SFT and RLHF have larger effects. The work provides open-source code and demonstrations, enabling practitioners to detect and potentially mitigate specification-induced biases in real-world deployments. All mathematical relations are expressed using formal causal notation to ground the analysis, including $P(Y|X)$, $P(Y|X,S)$, $X \not\rightarrow Y$, and the conditioning behavior $(S \perp\!\perp Y | X)_{G_S}$ in the presence of selection mechanisms.

Abstract

Modern language modeling tasks are often underspecified: for a given token prediction, many words may satisfy the user's intent of producing natural language at inference time, however only one word will minimize the task's loss function at training time. We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods, that we apply to gendered pronoun resolution tasks on a wide range of LLMs to 1) aid in the detection of inference-time task underspecification by exploiting 2) previously unreported gender vs. time and gender vs. location spurious correlations on LLMs with a range of A) sizes: from BERT-base to GPT-4 Turbo Preview, B) pre-training objectives: from masked & autoregressive language modeling to a mixture of these objectives, and C) training stages: from pre-training only to reinforcement learning from human feedback (RLHF). Code and open-source demos available at https://github.com/2dot71mily/uspec.

Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

TL;DR

, and the conditioning behavior

in the presence of selection mechanisms.

Abstract

Paper Structure (28 sections, 1 equation, 15 figures, 3 tables)

This paper contains 28 sections, 1 equation, 15 figures, 3 tables.

Introduction
Related Work
Contributions
Background: Selection Bias
Problem Settings
Illustrative Toy Task
Toy Data Structural Causal Model
Gendered Pronoun Resolution Task
Method 1 Measuring Correlations
Method 1 Experimental Setup
Method 1 Results and Discussion
Method 2 Specification Detection
Method 2 Experimental Setup
Method 2 Results and Discussion
Conclusion
...and 13 more sections

Figures (15)

Figure 1: Causal DAGs for which the prediction could be 'right for the wrong reasons' as related to task specification: (a) is well-specified, yet the model mostly relies on gender-occupation shortcut features; (b) through (d) are increasingly underspecified, with $X$ lacking any causal features for $Y$; where $X$ & $Y$ are the dataset's text-based features & labels, $B$ & $G$ are common causes of $X$ & $Y$: one a shortcut and one intended, and $W$ & $S$ are not causes of $Y$, but included due to their involvement in sample selection bias, $S\!$.
Figure 2: Graphs (a) and (b) show DAGs for (a) well-specified ($X \!\rightarrow\! Y$) and (b) unspecified ($X \!\not\!\rightarrow\! Y$) tasks. Plots (c) and (d) show the statistical relationships entailed by DAGs (a) and (b), when instantiated with the SCM defined in Equation \ref{['eq10']} to Equation \ref{['eq14']}, with three notable effects: 1) 'latent' sample selection bias: uncorrelated $W$ vs. $G$ in (i) become correlated in (ii) for both sampled well-specified and unspecified tasks; 2) specification-induced bias on well-specified tasks: the sampled well-specified $X$ vs. $Y$ correlation in (c)(iv) is largely unaffected by the latent $W$ vs. $G$ sample selection bias; 3) specification-induced bias on unspecified tasks: the sampled unspecified $X$ vs. $Y$ correlation in (d)(iv) is greatly affected by the latent $W$ vs. $G$ sample selection bias.
Figure 3: Evaluation of LLMs for latent gender vs. time and gender vs. location spurious correlations using the Masked Gender Task (MGC) evaluation set (see Table \ref{['tab:input-text']}). Models with MLM-like objectives (e.g. BERT and RoBERTa), use the MGC text alone. For models with an autoregressive LM objective (e.g. GPT-family), each MGC text is wrapped in simple instruction prompts, established prior to GPT-4 access (see Section \ref{['method-1-experimental-setup']}). Fig (a) shows the unnormalized softmax probabilities for predicted gendered pronouns, with each plotted dot representing the softmax probability for a given gendered prediction, $G$, averaged over the 60 texts injected with a given time or location value for $W\!$ (see more details in Section \ref{['gendered-calc']}). The shaded regions show the 95% confidence interval for the linear fit. Fig (b) plots LLM parameter count vs the average difference between the female and male linear-fit slopes from fig (a) for all prompts, with marker size scaling with the magnitude of the averaged $r^2$ Pearson's correlation coefficient.
Figure 4: Softmax probabilities from RoBERTa-large for predicted female pronouns, normalized over all gendered predictions, vs. a range of dates (injected into the text), for 'Doctor' Winogender texts, listed in Table \ref{['tabWinogender']}.
Figure 5: Task Specification Metric results from GPT-3.5 SFT and GPT-4 Turbo Preview on the Winogender and Winogender-Simplified benchmarks. This method exploits our finding that well-specified texts are less likely to exhibit specification-induced spurious correlations. 'Well-specified' texts are demarked with a blue horizontal or vertical bar. The remaining texts have a ground truth label of 'unspecified'. Perfect detection would appear as a horizontal row of blue 'plus' symbols (composed of the markers from both well-specified texts) below some thresholding line, with the all green markers above. See example Winogender input texts in Table \ref{['tabWinogender']}, and example Winogender-Simplified input texts in Section \ref{['simplified']}.
...and 10 more figures

Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

TL;DR

Abstract

Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (15)