CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

Vipul Gupta; Pranav Narayanan Venkit; Hugo Laurençon; Shomir Wilson; Rebecca J. Passonneau

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

Vipul Gupta, Pranav Narayanan Venkit, Hugo Laurençon, Shomir Wilson, Rebecca J. Passonneau

TL;DR

CALM introduces a robust, multi-task bias benchmark for language models by integrating 16 diverse datasets across QA, SA, and NLI to form 224 templates and 78,400 prompts focused on gender and race bias. It defines a bias score using $bs = rac{rac{ ext{#correct}_{sg}}{50} - ext{baseline}}{ ext{baseline}} imes 100$ and aggregates per-task and overall biases to compare social groups across models, while demonstrating robustness to template perturbations and prompt subset changes. Empirically, CALM evaluates 20 LLM families and finds that larger parameter models can be more biased in some cases, though the T0 series tends to be less biased, underscoring complex interactions between model scale, training, and sociodemographic bias. The work provides an extensible, benchmark-based approach for reliable cross-model bias assessment and mitigation, accompanied by an ethics-informed discussion on compute, environmental impact, and potential adverse effects of bias benchmarking results.

Abstract

As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 language model series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

TL;DR

and aggregates per-task and overall biases to compare social groups across models, while demonstrating robustness to template perturbations and prompt subset changes. Empirically, CALM evaluates 20 LLM families and finds that larger parameter models can be more biased in some cases, though the T0 series tends to be less biased, underscoring complex interactions between model scale, training, and sociodemographic bias. The work provides an extensible, benchmark-based approach for reliable cross-model bias assessment and mitigation, accompanied by an ethics-informed discussion on compute, environmental impact, and potential adverse effects of bias benchmarking results.

Abstract

Paper Structure (34 sections, 1 equation, 3 figures, 11 tables)

This paper contains 34 sections, 1 equation, 3 figures, 11 tables.

Introduction
Related Work
CALM Data and Score
Tasks
Template Creation
Bias Categories
Bias Score
Evaluation of CALM
Assessing CALM's Robustness: A Sensitivity Analysis
Prompt Subset Selection
Comparative Analysis with Other Bias Datasets : A Diversity Analysis
Qualitative Observations
Models Evaluated
Results
Template Error Analysis
...and 19 more sections

Figures (3)

Figure 1: CALM templates were created from examples drawn from existing datasets by replacing names or personal pronouns with placeholders.
Figure 2: Issues of prior datasets addressed in CALM.
Figure 3: This graph illustrates bias in Llama-2 and OPT models. Bias decreases with increasing size in Llama-2, but follows random pattern for OPT, increasing from 2.7B to 30B.

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

TL;DR

Abstract

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

Authors

TL;DR

Abstract

Table of Contents

Figures (3)