Table of Contents
Fetching ...

Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?

Vagrant Gautam, Eileen Bingert, Dawei Zhu, Anne Lauscher, Dietrich Klakow

TL;DR

RUFF, a carefully designed dataset of over 5 million instances to measure robust pronoun fidelity in English, is presented and it is shown that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy.

Abstract

Robust, faithful and harm-free pronoun use for individuals is an important goal for language model development as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later. We present RUFF, a carefully-designed dataset of over 5 million instances to measure robust pronoun fidelity in English, and we evaluate 37 model variants from nine popular families, across architectures (encoder-only, decoder-only and encoder-decoder) and scales (11M-70B parameters). When an individual is introduced with a pronoun, models can mostly faithfully reuse this pronoun in the next sentence, but they are significantly worse with she/her/her, singular they and neopronouns. Moreover, models are easily distracted by non-adversarial sentences discussing other people; even one sentence with a distractor pronoun causes accuracy to drop on average by 34 percentage points. Our results show that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy. We encourage researchers to bridge the gaps we find and to carefully evaluate reasoning in settings where superficial repetition might inflate perceptions of model performance.

Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?

TL;DR

RUFF, a carefully designed dataset of over 5 million instances to measure robust pronoun fidelity in English, is presented and it is shown that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy.

Abstract

Robust, faithful and harm-free pronoun use for individuals is an important goal for language model development as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later. We present RUFF, a carefully-designed dataset of over 5 million instances to measure robust pronoun fidelity in English, and we evaluate 37 model variants from nine popular families, across architectures (encoder-only, decoder-only and encoder-decoder) and scales (11M-70B parameters). When an individual is introduced with a pronoun, models can mostly faithfully reuse this pronoun in the next sentence, but they are significantly worse with she/her/her, singular they and neopronouns. Moreover, models are easily distracted by non-adversarial sentences discussing other people; even one sentence with a distractor pronoun causes accuracy to drop on average by 34 percentage points. Our results show that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy. We encourage researchers to bridge the gaps we find and to carefully evaluate reasoning in settings where superficial repetition might inflate perceptions of model performance.
Paper Structure (34 sections, 1 equation, 13 figures, 5 tables)

This paper contains 34 sections, 1 equation, 13 figures, 5 tables.

Figures (13)

  • Figure 1: We evaluate model accuracy at using the correct pronoun for an entity when provided with an explicit introduction and 0-5 non-adversarial distractor sentences. Llama-2-70B and RoBERTa-large show large accuracy drops with just one distractor. Accuracy is averaged over 3 data splits; standard deviation is shown with shading.
  • Figure 2: Template assembly for RUFF: occupation-specific task templates are matched with generic context templates (introductions and optional distractors) that are instantiated with disjoint pronoun sets. This creates realistic but controlled narratives that allow us to measure robust pronoun fidelity.
  • Figure 3: Model evaluation overview: pseudo log likelihoods (PLLs) and log likelihoods (LLs) of verbalized instances are used for encoder-only and decoder-only models; generations are used for chat models.
  • Figure 4: Counts of pronoun predictions from all models, in the absence of context. Error bars indicate standard deviation across models.
  • Figure 5: Pronoun fidelity by model with an introductory context. Accuracy is averaged across occupations, pronouns and grammatical cases, and is above chance (0.25) but below human performance (1.0).
  • ...and 8 more figures