Table of Contents
Fetching ...

Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

Mouhacine Benosman

TL;DR

The paper tackles measuring racial bias in large language models by proposing STAMP-LLM, a two-phase psychometric framework for designing bias measures in chatbots. It defines inference statistics, details a Definitional Phase for construct definition and item development, and a Data/Analysis Phase for standardized data collection and validation. In a racial-bias case study, the authors adapt a human racism scale and two implicit tests, observing high test-retest reliability but weak convergent validity with a Spearman correlation of $<0.25$, underscoring the need for AI-tailored validation. They advocate adopting STAMP-LLM as a standardized workflow to enable reproducible, cross-model bias assessment and safer deployments of LLMs.

Abstract

Artificial intelligence (AI), particularly in the form of large language models (LLMs) or chatbots, has become increasingly integrated into our daily lives. In the past five years, several LLMs have been introduced, including ChatGPT by OpenAI, Claude by Anthropic, and Llama by Meta, among others. These models have the potential to be employed across a wide range of human-machine interaction applications, such as chatbots for information retrieval, assistance in corporate hiring decisions, college admissions, financial loan approvals, parole determinations, and even in medical fields like psychotherapy delivered through chatbots. The key question is whether these chatbots will interact with humans in a bias-free manner or if they will further reinforce the existing pathological biases present in human-to-human interactions. If the latter is true, then how can we rigorously measure these biases? We address this challenge by introducing STAMP-LLM (Standardized Test and Assessment Measurement Protocol for LLMs), a psychometric-based principled two-phase framework for designing psychometric measures to evaluate chatbot biases: (i) a Definitional phase for construct mapping, item development, and expert review; and (ii) a Data/Analysis phase for protocol control (prompts/decoding), automated sampling, pre-specified scoring, and basic reliability/validity checks. We illustrate STAMP-LLM on racial bias using one explicit and two implicit measures.

Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

TL;DR

The paper tackles measuring racial bias in large language models by proposing STAMP-LLM, a two-phase psychometric framework for designing bias measures in chatbots. It defines inference statistics, details a Definitional Phase for construct definition and item development, and a Data/Analysis Phase for standardized data collection and validation. In a racial-bias case study, the authors adapt a human racism scale and two implicit tests, observing high test-retest reliability but weak convergent validity with a Spearman correlation of , underscoring the need for AI-tailored validation. They advocate adopting STAMP-LLM as a standardized workflow to enable reproducible, cross-model bias assessment and safer deployments of LLMs.

Abstract

Artificial intelligence (AI), particularly in the form of large language models (LLMs) or chatbots, has become increasingly integrated into our daily lives. In the past five years, several LLMs have been introduced, including ChatGPT by OpenAI, Claude by Anthropic, and Llama by Meta, among others. These models have the potential to be employed across a wide range of human-machine interaction applications, such as chatbots for information retrieval, assistance in corporate hiring decisions, college admissions, financial loan approvals, parole determinations, and even in medical fields like psychotherapy delivered through chatbots. The key question is whether these chatbots will interact with humans in a bias-free manner or if they will further reinforce the existing pathological biases present in human-to-human interactions. If the latter is true, then how can we rigorously measure these biases? We address this challenge by introducing STAMP-LLM (Standardized Test and Assessment Measurement Protocol for LLMs), a psychometric-based principled two-phase framework for designing psychometric measures to evaluate chatbot biases: (i) a Definitional phase for construct mapping, item development, and expert review; and (ii) a Data/Analysis phase for protocol control (prompts/decoding), automated sampling, pre-specified scoring, and basic reliability/validity checks. We illustrate STAMP-LLM on racial bias using one explicit and two implicit measures.

Paper Structure

This paper contains 8 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Sample of validity tests