Table of Contents
Fetching ...

Annotation alignment: Comparing LLM and human annotations of conversational safety

Rajiv Movva, Pang Wei Koh, Emma Pierson

TL;DR

It is found that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another, and there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment.

Abstract

Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, \textit{higher} than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.

Annotation alignment: Comparing LLM and human annotations of conversational safety

TL;DR

It is found that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another, and there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment.

Abstract

Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of with the average annotator rating, \textit{higher} than the median annotator's correlation with the average (). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
Paper Structure (21 sections, 3 equations, 4 figures, 4 tables)

This paper contains 21 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Human annotators disagree about what constitutes a safe chatbot (left). We study three questions around whether LLM annotators capture human annotation diversity (right): we measure safety annotation alignment with the average of 112 humans (RQ1) and with different annotator demographic groups (RQ2), and we evaluate whether GPT-4 can predict when one group rates a conversation more unsafe than another (RQ3).
  • Figure 2: GPT-4 does not align significantly more or less with different race-gender groups. Green: True Pearson correlations between GPT-4 ratings and the average ratings for a group. Grey: Null correlation distributions (means and 99% CIs), computed over 5,000 permutations of rater demographics. All green points lie within their null CIs.
  • Figure S1: The prompt we use to generate safety ratings of a user-chatbot conversation. This prompt follows the analyze-rate structure described in chiang_closer_2023. We also test a version of this prompt, rating-only, which removes the analyze step. In Appendix \ref{['sec:appendix_prompt']}, we describe the considerations involved in prompt design.
  • Figure S2: The prompt we use to study if GPT-4 can predict annotation disagreement: the extent to which one group of annotators rates a conversation more unsafe than another group of annotators. GROUP_A and GROUP_B are replaced by annotator race groups in our dataset, such as "white" and "Latinx". The model's Likert scores are compared to the true differences in mean group safety rating, $\mu_{G_B} - \mu_{G_A}$. We observe no statistically significant correlations for any of the tested group pairs.