Table of Contents
Fetching ...

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

Vipul Gupta, Pranav Narayanan Venkit, Hugo Laurençon, Shomir Wilson, Rebecca J. Passonneau

TL;DR

CALM introduces a robust, multi-task bias benchmark for language models by integrating 16 diverse datasets across QA, SA, and NLI to form 224 templates and 78,400 prompts focused on gender and race bias. It defines a bias score using $bs = rac{ rac{ ext{#correct}_{sg}}{50} - ext{baseline}}{ ext{baseline}} imes 100$ and aggregates per-task and overall biases to compare social groups across models, while demonstrating robustness to template perturbations and prompt subset changes. Empirically, CALM evaluates 20 LLM families and finds that larger parameter models can be more biased in some cases, though the T0 series tends to be less biased, underscoring complex interactions between model scale, training, and sociodemographic bias. The work provides an extensible, benchmark-based approach for reliable cross-model bias assessment and mitigation, accompanied by an ethics-informed discussion on compute, environmental impact, and potential adverse effects of bias benchmarking results.

Abstract

As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 language model series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

TL;DR

CALM introduces a robust, multi-task bias benchmark for language models by integrating 16 diverse datasets across QA, SA, and NLI to form 224 templates and 78,400 prompts focused on gender and race bias. It defines a bias score using and aggregates per-task and overall biases to compare social groups across models, while demonstrating robustness to template perturbations and prompt subset changes. Empirically, CALM evaluates 20 LLM families and finds that larger parameter models can be more biased in some cases, though the T0 series tends to be less biased, underscoring complex interactions between model scale, training, and sociodemographic bias. The work provides an extensible, benchmark-based approach for reliable cross-model bias assessment and mitigation, accompanied by an ethics-informed discussion on compute, environmental impact, and potential adverse effects of bias benchmarking results.

Abstract

As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 language model series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.
Paper Structure (34 sections, 1 equation, 3 figures, 11 tables)

This paper contains 34 sections, 1 equation, 3 figures, 11 tables.

Figures (3)

  • Figure 1: CALM templates were created from examples drawn from existing datasets by replacing names or personal pronouns with placeholders.
  • Figure 2: Issues of prior datasets addressed in CALM.
  • Figure 3: This graph illustrates bias in Llama-2 and OPT models. Bias decreases with increasing size in Llama-2, but follows random pattern for OPT, increasing from 2.7B to 30B.