Table of Contents
Fetching ...

BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models

Alok Abhishek, Lisa Erickson, Tushar Bandopadhyay

TL;DR

BEATS introduces a formal, scalable framework for Bias, Ethics, Fairness, and Factuality evaluation in Large Language Models. It combines a 901-question BEATS benchmark with 29 BEFF metrics and a consortium of LLMs as judges to enable robust, statistically grounded comparisons across leading models. BEATS defines a modular metric architecture with BEATS(R) = {BIAS(R), FAIRNESS(R), ETHICS(R), FACTUALITY(R)}, decomposed into submetrics such as BP(R), BC(R), BM(R), DP(R), EO(R), GA(R), EA(R), VA(R), FA(R), MI(R), and others, all expressed on standardized scales. Empirical results show substantial bias presence (37.65% of outputs) and widespread yet variable ethical and fairness performance, underscoring the need for targeted mitigation, governance, and continued evaluation to support socially responsible AI deployment.

Abstract

In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs). Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk. These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities. To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation. Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems. BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies. With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.

BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models

TL;DR

BEATS introduces a formal, scalable framework for Bias, Ethics, Fairness, and Factuality evaluation in Large Language Models. It combines a 901-question BEATS benchmark with 29 BEFF metrics and a consortium of LLMs as judges to enable robust, statistically grounded comparisons across leading models. BEATS defines a modular metric architecture with BEATS(R) = {BIAS(R), FAIRNESS(R), ETHICS(R), FACTUALITY(R)}, decomposed into submetrics such as BP(R), BC(R), BM(R), DP(R), EO(R), GA(R), EA(R), VA(R), FA(R), MI(R), and others, all expressed on standardized scales. Empirical results show substantial bias presence (37.65% of outputs) and widespread yet variable ethical and fairness performance, underscoring the need for targeted mitigation, governance, and continued evaluation to support socially responsible AI deployment.

Abstract

In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs). Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk. These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities. To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation. Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems. BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies. With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.

Paper Structure

This paper contains 44 sections, 13 equations, 27 figures, 10 tables.

Figures (27)

  • Figure 1: System design of BEATS evaluation framework - the proposed framework for bias assessment in LLM. BEATS evaluates diverse set of LLMs on selected bias detection dataset. BEATS then employs a consortium of LLM-as-a-Judge to quantify a set of curated metrics related to bias, fairness, ethics, and factuality.
  • Figure 2: Total cumulative bias presence scores across large language model families, as evaluated by the BEATS framework. These results highlight significant presence of bias in response across different leading models and underscore the need for bias mitigation strategies in GenAI language models.
  • Figure 3: Category-wise bias presence across as evaluated by the BEATS framework across five leading Large Language Models. Each bar represents the total occurrence of a specific bias category. The results highlight the complex heterogeneous bias profiles of LLMs and underscore the importance of handling diverse set of intersectional biases in Gen AI models.
  • Figure 4: Hexbin density plot showing the joint distribution of Bias Severity Score and Bias Impact Score for response from all models, as evaluated by the BEATS framework using Claude-3.5 Sonnet as the Judge. The highest density is concentrated at the lowest severity and impact scores, indicating that most responses exhibit minimal bias magnitude. However, a significant number of moderate-to-high severity and impact clusters suggest a prevalent generation of responses with non-trivial ethical or societal implications. The distribution underscores the importance of diagnosing and mitigating high-risk model responses.
  • Figure 5: This box-and-whisker plot illustrates the distribution of ethics related BEATS evaluation metrics across LLM-generated responses. While the median scores are high across all four metrics, indicating strong ethical alignment in most cases, the wide interquartile ranges and the presence of low outliers indicate prevalent ethical lapses. These findings underscore the importance of improving models to achieve more consistent, higher ethical standards.
  • ...and 22 more figures