Table of Contents
Fetching ...

Textual Entailment and Token Probability as Bias Evaluation Metrics

Virginia K. Felkner, Allison Lim, Jonathan May

TL;DR

The work investigates social bias in language models by comparing token probability (TP) based bias metrics with a novel NLI-based, midstream bias evaluation. By converting the WinoQueer TP dataset into WQ-NLI, the authors directly compare bias signals under the same bias definitions across nine models and three debiasing conditions, using eight aggregation metrics ($M_1$–$M_8$) to derive percentile bias scores. They find only weak correlations between TP and NLI metrics ($R^2$ up to $0.328$) and observe that NLI often detects underdebiased categories but yields brittle, inconsistent metrics across configurations. The study concludes that neither TP nor NLI is universally superior and recommends combining TP, NLI, and downstream bias evaluations to achieve comprehensive bias audits across models and use cases.

Abstract

Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.

Textual Entailment and Token Probability as Bias Evaluation Metrics

TL;DR

The work investigates social bias in language models by comparing token probability (TP) based bias metrics with a novel NLI-based, midstream bias evaluation. By converting the WinoQueer TP dataset into WQ-NLI, the authors directly compare bias signals under the same bias definitions across nine models and three debiasing conditions, using eight aggregation metrics () to derive percentile bias scores. They find only weak correlations between TP and NLI metrics ( up to ) and observe that NLI often detects underdebiased categories but yields brittle, inconsistent metrics across configurations. The study concludes that neither TP nor NLI is universally superior and recommends combining TP, NLI, and downstream bias evaluations to achieve comprehensive bias audits across models and use cases.

Abstract

Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.

Paper Structure

This paper contains 20 sections, 1 equation, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Entailment and token probability bias scores for selected categories on three BERT Large Uncased models (undebiased, news-debiased, and twitter-debiased). These categories appear to be adequately debiased on the token probability metric, but the NLI metric shows lingering model bias.
  • Figure 2: Entailment and token probability bias scores for selected categories on three BERT Base Uncased models (undebiased, news-debiased, and twitter-debiased). While results are less pronounced that BERT Large Uncased, NLI is still better at detecting underdebiased categories.
  • Figure 3: Entailment and token probability bias scores for selected categories on three BERT Base Cased models (undebiased, news-debiased, and twitter-debiased). Observed patterns are generally weaker but still present for cased BERT models.
  • Figure 4: Entailment and token probability bias scores for selected categories on three BERT Large Cased models (undebiased, news-debiased, and twitter-debiased). Observed patterns are generally weaker but still present for cased BERT models.
  • Figure 5: Entailment and token probability bias scores for selected categories on three RoBERTa Base models (undebiased, news-debiased, and twitter-debiased). For these models, NLI is clearly identifying lingering biases which TP scores do not.
  • ...and 4 more figures