Table of Contents
Fetching ...

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback

Emilia Agis Lerner, Florian E. Dorner, Elliott Ash, Naman Goel

TL;DR

This work investigates how annotator demographics shape fairness preferences in the context of learning from human feedback for content moderation. By combining a prior dataset with a newly collected, highly variable dataset of 1500 sentence pairs labeled by 1000 annotators, it shows that age, education, and political stance significantly influence personal fairness judgments, while most demographics affect beliefs about the average American's fairness view. It demonstrates that downstream models trained on demographic-specific data display diverse performance and that an ensemble across demographic models can improve balanced accuracy for several demographic intersections. The findings highlight ethical and practical implications for leveraging human feedback in AI systems, emphasizing the need to account for demographic diversity and potential representation gaps when aligning AI with human fairness preferences.

Abstract

There is a growing body of work on learning from human feedback to align various aspects of machine learning systems with human values and preferences. We consider the setting of fairness in content moderation, in which human feedback is used to determine how two comments -- referencing different sensitive attribute groups -- should be treated in comparison to one another. With a novel dataset collected from Prolific and MTurk, we find significant gaps in fairness preferences depending on the race, age, political stance, educational level, and LGBTQ+ identity of annotators. We also demonstrate that demographics mentioned in text have a strong influence on how users perceive individual fairness in moderation. Further, we find that differences also exist in downstream classifiers trained to predict human preferences. Finally, we observe that an ensemble, giving equal weight to classifiers trained on annotations from different demographics, performs better for different demographic intersections; compared to a single classifier that gives equal weight to each annotation.

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback

TL;DR

This work investigates how annotator demographics shape fairness preferences in the context of learning from human feedback for content moderation. By combining a prior dataset with a newly collected, highly variable dataset of 1500 sentence pairs labeled by 1000 annotators, it shows that age, education, and political stance significantly influence personal fairness judgments, while most demographics affect beliefs about the average American's fairness view. It demonstrates that downstream models trained on demographic-specific data display diverse performance and that an ensemble across demographic models can improve balanced accuracy for several demographic intersections. The findings highlight ethical and practical implications for leveraging human feedback in AI systems, emphasizing the need to account for demographic diversity and potential representation gaps when aligning AI with human fairness preferences.

Abstract

There is a growing body of work on learning from human feedback to align various aspects of machine learning systems with human values and preferences. We consider the setting of fairness in content moderation, in which human feedback is used to determine how two comments -- referencing different sensitive attribute groups -- should be treated in comparison to one another. With a novel dataset collected from Prolific and MTurk, we find significant gaps in fairness preferences depending on the race, age, political stance, educational level, and LGBTQ+ identity of annotators. We also demonstrate that demographics mentioned in text have a strong influence on how users perceive individual fairness in moderation. Further, we find that differences also exist in downstream classifiers trained to predict human preferences. Finally, we observe that an ensemble, giving equal weight to classifiers trained on annotations from different demographics, performs better for different demographic intersections; compared to a single classifier that gives equal weight to each annotation.
Paper Structure (33 sections, 7 figures, 19 tables)

This paper contains 33 sections, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Example tasks in our survey, asking people about their fairness preferences and their guess of the average American answer. Each task contains a pair of sentences; sentences in a pair differ along a sensitive attribute such as religion, gender, etc.
  • Figure 2: Distribution of unalikeability coefficients in each focus category of sentence pairs. In mixed category, sentences in a pair refer to different sensitive attributes, whereas in other categories (gender, race, religion), sentences in a pair refer to different values (e.g. men, women) of the same sensitive attribute.
  • Figure 3: Distribution of unalikeability coefficients per dataset.
  • Figure 4: Distribution of unalikeability coefficients per label type.
  • Figure 5: Balanced Accuracy scores of ten iterations model $\hat{\phi}_1$, when tested on data with different thresholds of unalikeability coefficients. The darker blue line represents the mean.
  • ...and 2 more figures