Table of Contents
Fetching ...

WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case

Vagrant Gautam, Julius Steuer, Eileen Bingert, Ray Johns, Anne Lauscher, Dietrich Klakow

TL;DR

This work evaluates two state-of-the-art supervised coreference resolution systems, SpanBERT, and five sizes of FLAN-T5, and proposes a new method to evaluate pronominal bias in coreference resolution that goes beyond the binary.

Abstract

While measuring bias and robustness in coreference resolution are important goals, such measurements are only as good as the tools we use to measure them. Winogender Schemas (Rudinger et al., 2018) are an influential dataset proposed to evaluate gender bias in coreference resolution, but a closer look reveals issues with the data that compromise its use for reliable evaluation, including treating different pronominal forms as equivalent, violations of template constraints, and typographical errors. We identify these issues and fix them, contributing a new dataset: WinoPron. Using WinoPron, we evaluate two state-of-the-art supervised coreference resolution systems, SpanBERT, and five sizes of FLAN-T5, and demonstrate that accusative pronouns are harder to resolve for all models. We also propose a new method to evaluate pronominal bias in coreference resolution that goes beyond the binary. With this method, we also show that bias characteristics vary not just across pronoun sets (e.g., he vs. she), but also across surface forms of those sets (e.g., him vs. his).

WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case

TL;DR

This work evaluates two state-of-the-art supervised coreference resolution systems, SpanBERT, and five sizes of FLAN-T5, and proposes a new method to evaluate pronominal bias in coreference resolution that goes beyond the binary.

Abstract

While measuring bias and robustness in coreference resolution are important goals, such measurements are only as good as the tools we use to measure them. Winogender Schemas (Rudinger et al., 2018) are an influential dataset proposed to evaluate gender bias in coreference resolution, but a closer look reveals issues with the data that compromise its use for reliable evaluation, including treating different pronominal forms as equivalent, violations of template constraints, and typographical errors. We identify these issues and fix them, contributing a new dataset: WinoPron. Using WinoPron, we evaluate two state-of-the-art supervised coreference resolution systems, SpanBERT, and five sizes of FLAN-T5, and demonstrate that accusative pronouns are harder to resolve for all models. We also propose a new method to evaluate pronominal bias in coreference resolution that goes beyond the binary. With this method, we also show that bias characteristics vary not just across pronoun sets (e.g., he vs. she), but also across surface forms of those sets (e.g., him vs. his).
Paper Structure (45 sections, 5 figures, 10 tables)

This paper contains 45 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Problems with Winogender Schemas that we fix in our new coreference resolution dataset, WinoPron. Correct antecedents appear in bold.
  • Figure 2: Winogender Schemas for cashier, customer and possessive pronouns, with the antecedent bolded.
  • Figure 3: Accuracy on WinoPron by case and pronoun series with supervised coreference resolution systems (CAW-coref and LingMess), and language models fine-tuned for coreference resolution (SpanBERT) and prompted zero-shot (FLAN-T5), compared to random performance (50%). Accusative pronoun performance is worse than other grammatical cases, and singular they and the neopronoun xe are challenging for several models.
  • Figure 4: Example groups for scoring consistency metrics using WinoPron templates for counselor, patient and possessive pronouns, with the antecedent bolded.
  • Figure 5: Percentage of model-attempted templates that show bias, for SpanBERT-base and SpanBERT-large.