Table of Contents
Fetching ...

A Group Fairness Lens for Large Language Models

Guanqun Bi, Yuqiang Xie, Lei Shen, Yanan Cao

TL;DR

The paper introduces a group fairness lens for evaluating LLM bias using a hierarchical dimension-target-attribute schema and the GFair dataset drawn from real-world sources. It pairs this with a novel statement organization task to surface subtle biases, and proposes GF-Think, a chain-of-thought–based mitigation approach. Experiments across open-source and commercial LLMs reveal substantial cross-model variability in social bias and group fairness, highlighting safety concerns and the need for robust, dimension-aware debiasing. The work demonstrates significant bias reduction through GF-Think and provides a foundation for broader, multilingual fairness evaluation in large language models.

Abstract

The need to assess LLMs for bias and fairness is critical, with current evaluations often being narrow, missing a broad categorical view. In this paper, we propose evaluating the bias and fairness of LLMs from a group fairness lens using a novel hierarchical schema characterizing diverse social groups. Specifically, we construct a dataset, GFAIR, encapsulating target-attribute combinations across multiple dimensions. Moreover, we introduce statement organization, a new open-ended text generation task, to uncover complex biases in LLMs. Extensive evaluations of popular LLMs reveal inherent safety concerns. To mitigate the biases of LLMs from a group fairness perspective, we pioneer a novel chainof-thought method GF-THINK to mitigate biases of LLMs from a group fairness perspective. Experimental results demonstrate its efficacy in mitigating bias and achieving fairness in LLMs. Our dataset and codes are available at https://github.com/surika/Group-Fairness-LLMs.

A Group Fairness Lens for Large Language Models

TL;DR

The paper introduces a group fairness lens for evaluating LLM bias using a hierarchical dimension-target-attribute schema and the GFair dataset drawn from real-world sources. It pairs this with a novel statement organization task to surface subtle biases, and proposes GF-Think, a chain-of-thought–based mitigation approach. Experiments across open-source and commercial LLMs reveal substantial cross-model variability in social bias and group fairness, highlighting safety concerns and the need for robust, dimension-aware debiasing. The work demonstrates significant bias reduction through GF-Think and provides a foundation for broader, multilingual fairness evaluation in large language models.

Abstract

The need to assess LLMs for bias and fairness is critical, with current evaluations often being narrow, missing a broad categorical view. In this paper, we propose evaluating the bias and fairness of LLMs from a group fairness lens using a novel hierarchical schema characterizing diverse social groups. Specifically, we construct a dataset, GFAIR, encapsulating target-attribute combinations across multiple dimensions. Moreover, we introduce statement organization, a new open-ended text generation task, to uncover complex biases in LLMs. Extensive evaluations of popular LLMs reveal inherent safety concerns. To mitigate the biases of LLMs from a group fairness perspective, we pioneer a novel chainof-thought method GF-THINK to mitigate biases of LLMs from a group fairness perspective. Experimental results demonstrate its efficacy in mitigating bias and achieving fairness in LLMs. Our dataset and codes are available at https://github.com/surika/Group-Fairness-LLMs.
Paper Structure (33 sections, 4 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 33 sections, 4 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Some examples that lack group fairness. For the same attribute with only the target altered, the output shows toxicity towards the target middle-aged but was safe for the target elderly. Additionally, when shifting the dimension from age to nationality, the LLM declines to comment.
  • Figure 2: Relation between bias and fairness.
  • Figure 3: An illustration of the statement organization evaluation method.
  • Figure 4: The significant difference results of the GPT-4 model across dimensions. Darker shades indicate lower p-values. p < 0.05 cells with black-blue color indicates a statistically significant difference between the compared groups.
  • Figure 5: Standard deviation between targets under each dimension.
  • ...and 3 more figures