Table of Contents
Fetching ...

Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness

Shayan Alipour, Indira Sen, Mattia Samory, Tanushree Mitra

TL;DR

Alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment, highlighting the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.

Abstract

Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors. In this work, we examine LLM alignment with human annotations in five offensive language datasets, comprising approximately 220K annotations. Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors. Confounders -- such as document difficulty, annotator sensitivity, and within-group agreement -- account for more variation in alignment patterns than demographic traits alone. Specifically, alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.

Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness

TL;DR

Alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment, highlighting the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.

Abstract

Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors. In this work, we examine LLM alignment with human annotations in five offensive language datasets, comprising approximately 220K annotations. Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors. Confounders -- such as document difficulty, annotator sensitivity, and within-group agreement -- account for more variation in alignment patterns than demographic traits alone. Specifically, alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.

Paper Structure

This paper contains 26 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of model correlations with human annotators against human agreement (individual annotators with their peers) which highlights how well models align with human judgment.
  • Figure 2: Pearson correlation coefficients between model outputs and human annotator labels, broken down by gender (a) and ethnicity (b) across five datasets. The ground truth for each post is determined by averaging the labels from annotators belonging to the target demographic. Darker shades indicate stronger correlations. Confidence intervals and p-values for statistical significance are reported in Table \ref{['tab:average-demo-model-corr']} in the Appendix.
  • Figure 3: The 95% confidence intervals (CI) for the difference in correlation between the model's predictions and two demographic groups, computed as: $\Delta r = r(P, D_1) - r(P, D_2)$, where $P$ represents the model's predictions, and $D_1$ and $D_2$ are two demographic groups. Ground truth for each post is determined by averaging the labels from annotators in the target demographic. The intervals are derived from 1,000 bootstrap samples. If the CI includes zero, the difference is not statistically significant. See Table \ref{['tab:average-corr-diff']} in the Appendix for further details.
  • Figure 4: Comparison of model correlations with human annotators against human agreement (individual annotators with their peers), highlighting how well models align with human judgment. The ground truth for each post is determined by the majority vote of annotators' labels. For human agreement, correlations are measured by leaving out one annotator and comparing their labels to the ground truth from the remaining annotators. Error bars represent 95% confidence intervals.
  • Figure 5: Pearson correlation coefficients between model outputs and human annotator labels, broken down by gender (a) and ethnicity (b) across five datasets. The ground truth for each post is determined by the majority vote of annotators from the target demographic. Darker shades indicate stronger correlations. Confidence intervals and p-values for statistical significance are reported in table \ref{['tab:majority-demo-model-corr']}.
  • ...and 3 more figures