Table of Contents
Fetching ...

UnQovering Stereotyping Biases via Underspecified Questions

Tao Li, Tushar Khot, Daniel Khashabi, Ashish Sabharwal, Vivek Srikumar

TL;DR

UnQover develops a general framework to expose stereotyping biases in QA systems by using underspecified questions that minimize factual grounding. It formalizes bias measurement by stripping away confounds from positional dependence and attribute indifference, yielding robust comparative and aggregate metrics across subjects, attributes, and templates. Across five transformer-based QA models, two QA datasets, and four bias classes, the study finds that larger models tend to harbor more bias, fine-tuning shifts bias in size- and data-dependent ways, and newer QA models show reduced bias relative to older counterparts. The work provides a principled evaluation toolkit for bias in QA and offers insights into model behavior and potential mitigation directions, while acknowledging Western-centric data limitations and binary gender simplifications.

Abstract

While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence. We design a formalism that isolates the aforementioned errors. As case studies, we use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion. We probe five transformer-based QA models trained on two QA datasets, along with their underlying language models. Our broad study reveals that (1) all these models, with and without fine-tuning, have notable stereotyping biases in these classes; (2) larger models often have higher bias; and (3) the effect of fine-tuning on bias varies strongly with the dataset and the model size.

UnQovering Stereotyping Biases via Underspecified Questions

TL;DR

UnQover develops a general framework to expose stereotyping biases in QA systems by using underspecified questions that minimize factual grounding. It formalizes bias measurement by stripping away confounds from positional dependence and attribute indifference, yielding robust comparative and aggregate metrics across subjects, attributes, and templates. Across five transformer-based QA models, two QA datasets, and four bias classes, the study finds that larger models tend to harbor more bias, fine-tuning shifts bias in size- and data-dependent ways, and newer QA models show reduced bias relative to older counterparts. The work provides a principled evaluation toolkit for bias in QA and offers insights into model behavior and potential mitigation directions, while acknowledging Western-centric data limitations and binary gender simplifications.

Abstract

While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence. We design a formalism that isolates the aforementioned errors. As case studies, we use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion. We probe five transformer-based QA models trained on two QA datasets, along with their underlying language models. Our broad study reveals that (1) all these models, with and without fine-tuning, have notable stereotyping biases in these classes; (2) larger models often have higher bias; and (3) the effect of fine-tuning on bias varies strongly with the dataset and the model size.

Paper Structure

This paper contains 42 sections, 1 theorem, 12 equations, 6 figures, 18 tables.

Key Result

Proposition 1

The comparative metric $\mathbb{C}\left(\cdot\right)$ lies in $[-1,1]$ and satisfies the following properties:

Figures (6)

  • Figure 1: Examples from UnQover: We intentionally design them to not have an obvious answer.
  • Figure 2: Examples that illustrate reasoning errors of positional dependence and attribute independence. $\tau_{2,1}$ is by swapping the subjects in $\tau_{1,2}$. $\bar{a}$ is the attribute with negated meanings. We use RoBERTa$_{\textrm{B}}$ fine-tuned on SQuAD.
  • Figure 3: Model bias intensity $\mu$. Models are arranged by their sizes for BERT and RoBERTa classes.
  • Figure 4: Average and stddev. of the ranks of $69$ nationalities by $\gamma(x)$ across five SQuAD models. A smaller rank indicates more negative sentiment. We show the top/bottom-8 and trim those that fall in the middle. Note that the ranks are based on our dataset, and are not general statements about the countries.
  • Figure 5: Average and stddev. of ranks of ethnicities (top) and religions (bottom) by $\gamma(x)$ across five SQuAD models. A smaller rank indicates more negative sentiment. Note that the ranks are based on our dataset, and are not a general statement about the groups.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof