CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Song Wang; Peng Wang; Tong Zhou; Yushun Dong; Zhen Tan; Jundong Li

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Song Wang, Peng Wang, Tong Zhou, Yushun Dong, Zhen Tan, Jundong Li

TL;DR

CEB introduces a Compositional Evaluation Benchmark to address fragmentation in fairness evaluation for large language models. By structuring datasets along a three-axis taxonomy—bias type (stereotyping, toxicity), social group (age, gender, race, religion), and task (direct and indirect)—CEB unifies existing bias datasets and enables systematic exploration of new configurations. Comprehensive experiments across GPT-3.5/4 and open LLMs reveal that bias levels vary by configuration and that generation vs classification tasks exhibit different difficulty and risk profiles, with GPT-4 often serving as a strong bias evaluator. The benchmark provides a scalable, configurable tool for fair model assessment and targeted bias mitigation in real-world deployments.

Abstract

As Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks, concerns regarding the potential negative societal impacts of LLM-generated content have also arisen. To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets. However, existing bias evaluation efforts often focus on only a particular type of bias and employ inconsistent evaluation metrics, leading to difficulties in comparison across different datasets and LLMs. To address these limitations, we collect a variety of datasets designed for the bias evaluation of LLMs, and further propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks. The curation of CEB is based on our newly proposed compositional taxonomy, which characterizes each dataset from three dimensions: bias types, social groups, and tasks. By combining the three dimensions, we develop a comprehensive evaluation strategy for the bias in LLMs. Our experiments demonstrate that the levels of bias vary across these dimensions, thereby providing guidance for the development of specific bias mitigation methods.

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

TL;DR

Abstract

Paper Structure (55 sections, 33 figures, 15 tables)

This paper contains 55 sections, 33 figures, 15 tables.

Introduction
Related Work
Compositional Taxonomy in CEB
Bias Type
Social Group
Task
Configurations of Existing Datasets
CEB Dataset Construction
CEB-Recognition and CEB-Selection
CEB-Continuation and CEB-Conversation
CEB Datasets for Classification (CEB-Adult, CEB-Credit, and CEB-Jigsaw)
Experimental Setup
Evaluation Metrics
Models
Experimental Results
...and 40 more sections

Figures (33)

Figure 1: An example from a bias evaluation dataset BBQ parrish-etal-2022-bbq.
Figure 2: The two drawbacks of existing datasets.
Figure 3: Left: Our compositional taxonomy of datasets, characterizing three key components: bias types, social groups, and tasks. Center: The exemplar prompts as LLM input for different tasks of the Stereotyping bias type. Right: Evaluation metrics for tasks.
Figure 4: The detailed dataset construction process of five tasks with the bias type of Stereotyping based on our compositional taxonomy. The process of the Toxicity bias type is similar, except that "stereotypical" is replaced with "toxic".
Figure 5: The visualizations of results for Stereotyping across various LLMs. We omit the results for Llama2-7b for Continuation & Conversation tasks due to the large RtA (Refuse to Answer) rates.
...and 28 more figures

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

TL;DR

Abstract

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (33)