Undesirable Biases in NLP: Addressing Challenges of Measurement

Oskar van der Wal; Dominik Bachmann; Alina Leidinger; Leendert van Maanen; Willem Zuidema; Katrin Schulz

Undesirable Biases in NLP: Addressing Challenges of Measurement

Oskar van der Wal, Dominik Bachmann, Alina Leidinger, Leendert van Maanen, Willem Zuidema, Katrin Schulz

TL;DR

The paper addresses the challenge of measuring undesirable biases in NLP by importing psychometrics concepts, particularly construct validity and reliability, to create a principled framework for bias measurement. It surveys existing NLP bias measures, discusses the translation from human psychology to NLP contexts, and outlines concrete reliability and validity assessments tailored to bias benchmarks and prompting. The authors propose a practical design cycle (Preparation, Development, Post-Development) to improve measurement quality, transparency, and cross-context comparability, with attention to downstream harms and sociotechnical contexts. They acknowledge limitations (e.g., multilingual transfer, model dynamics) and advocate for interdisciplinary collaboration to advance robust, responsible bias measurement tools in NLP.

Abstract

As Large Language Models and Natural Language Processing (NLP) technology rapidly develop and spread into daily life, it becomes crucial to anticipate how their use could harm people. One problem that has received a lot of attention in recent years is that this technology has displayed harmful biases, from generating derogatory stereotypes to producing disparate outcomes for different social groups. Although a lot of effort has been invested in assessing and mitigating these biases, our methods of measuring the biases of NLP models have serious problems and it is often unclear what they actually measure. In this paper, we provide an interdisciplinary approach to discussing the issue of NLP model bias by adopting the lens of psychometrics -- a field specialized in the measurement of concepts like bias that are not directly observable. In particular, we will explore two central notions from psychometrics, the construct validity and the reliability of measurement tools, and discuss how they can be applied in the context of measuring model bias. Our goal is to provide NLP practitioners with methodological tools for designing better bias measures, and to inspire them more generally to explore tools from psychometrics when working on bias measurement tools.

Undesirable Biases in NLP: Addressing Challenges of Measurement

TL;DR

Abstract

Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Introduction
Measuring Bias as an Unobservable Concept
A Brief Introduction to Bias Measures in NLP
Why Measure Bias in NLP Models
The Translation Step: From Psychometrics to NLP
Differences Between Model Bias as a Construct and its Operationalizations
Construct Validity and Reliability
Assessing the Reliability of Bias Measures
Inter-Rater Reliability
Internal Consistency
Parallel-Form Reliability
Test-Retest Reliability
Seed-Based Test-Retest Reliability
Corpus-Based Test-Retest Reliability
Time-Based Test-Retest Reliability
...and 8 more sections

Figures (4)

Figure 1: We assume that a training dataset's bias influences the bias of a model trained on that data (but other possible sources of bias are possible, e.g., model compression may amplify existing biases ( ? )). Training dataset bias and model bias are unobservable constructs (circle) that both have different possible operationalizations (squares).
Figure 2: This figure illustrates the difference between convergent and divergent validity (see Section \ref{['subsec:divergent_validity']}). In this example, the convergent validity is assessed by testing how related a gender bias measure is to another gender bias measure. The divergent validity, instead, is assessed by testing whether the gender bias measure is not strongly correlated with a measure for another, but easily confounded construct (e.g., grammatical gender).
Figure 3: In Section \ref{['subsec:convergent-validity']}, we discuss two ways to validate bias measures through related concepts: (1) testing whether the found bias reflects pre-existing stereotypes in society (e.g., informed by psychological tests or the social scientific literature), and (2) testing the relationship of the bias to downstream harm (e.g., representational and allocative harms). One's theoretical assumptions about the "model bias" construct inform the strengths of the relationships that one expects to find with these related concepts.
Figure 4: Content validity: in this example, male-as-norm bias and gender stereotypes are hypothesized to be separate subconcepts of the model's construct gender bias. The male-as-norm bias reflects the idea that the male gender is assumed as default, unless explicitly indicated otherwise ( ? ); This could also be reflected by a high prior for male pronouns. Gender stereotypes can refer to a broad category of phenomena where certain genders are associated with social norms, roles, or attributes and traits (e.g., women are seen as more passive; ? ).

Undesirable Biases in NLP: Addressing Challenges of Measurement

TL;DR

Abstract

Undesirable Biases in NLP: Addressing Challenges of Measurement

Authors

TL;DR

Abstract

Table of Contents

Figures (4)