Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution
Emily McMilin
TL;DR
The paper tackles underspecification in language modeling by proposing a simple causality-informed mechanism in which incomplete task specification induces latent selection bias, leading to spurious gender biases such as time- and location-associated pronoun preferences. It introduces two lightweight black-box evaluation methods: (1) measuring correlations between injected time/location cues and gendered pronoun predictions (Method 1), and (2) a specification-detection metric that flags unspecified tasks at inference (Method 2). Across a broad spectrum of models from BERT-base to GPT-4 Turbo Preview, the study finds that model size has limited impact on these specification-induced correlations, while post-training objectives like SFT and RLHF have larger effects. The work provides open-source code and demonstrations, enabling practitioners to detect and potentially mitigate specification-induced biases in real-world deployments. All mathematical relations are expressed using formal causal notation to ground the analysis, including $P(Y|X)$, $P(Y|X,S)$, $X \not\rightarrow Y$, and the conditioning behavior $(S \perp\!\perp Y | X)_{G_S}$ in the presence of selection mechanisms.
Abstract
Modern language modeling tasks are often underspecified: for a given token prediction, many words may satisfy the user's intent of producing natural language at inference time, however only one word will minimize the task's loss function at training time. We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods, that we apply to gendered pronoun resolution tasks on a wide range of LLMs to 1) aid in the detection of inference-time task underspecification by exploiting 2) previously unreported gender vs. time and gender vs. location spurious correlations on LLMs with a range of A) sizes: from BERT-base to GPT-4 Turbo Preview, B) pre-training objectives: from masked & autoregressive language modeling to a mixture of these objectives, and C) training stages: from pre-training only to reinforcement learning from human feedback (RLHF). Code and open-source demos available at https://github.com/2dot71mily/uspec.
