Annotation alignment: Comparing LLM and human annotations of conversational safety

Rajiv Movva; Pang Wei Koh; Emma Pierson

Annotation alignment: Comparing LLM and human annotations of conversational safety

Rajiv Movva, Pang Wei Koh, Emma Pierson

TL;DR

It is found that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another, and there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment.

Abstract

Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, \textit{higher} than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.

Annotation alignment: Comparing LLM and human annotations of conversational safety

TL;DR

Abstract

with the average annotator rating, \textit{higher} than the median annotator's correlation with the average (

). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.

Paper Structure (21 sections, 3 equations, 4 figures, 4 tables)

This paper contains 21 sections, 3 equations, 4 figures, 4 tables.

Introduction
Data and Models
Results
RQ1: GPT-4 and Llama 3.1 surpass the median annotator in terms of correlation with the average annotator rating.
RQ2: The dataset is underpowered to detect demographic differences in annotator-LLM alignment.
RQ3: GPT-4 cannot predict demographic disagreements.
Discussion and Related Work
Conclusion
Data description
Demographics.
Annotator quality.
Models
Prompts and reliability checks
A single, joint safety rating instead of separate, per-criterion ratings.
Likert rating instead of binary rating.
...and 6 more sections

Figures (4)

Figure 1: Human annotators disagree about what constitutes a safe chatbot (left). We study three questions around whether LLM annotators capture human annotation diversity (right): we measure safety annotation alignment with the average of 112 humans (RQ1) and with different annotator demographic groups (RQ2), and we evaluate whether GPT-4 can predict when one group rates a conversation more unsafe than another (RQ3).
Figure 2: GPT-4 does not align significantly more or less with different race-gender groups. Green: True Pearson correlations between GPT-4 ratings and the average ratings for a group. Grey: Null correlation distributions (means and 99% CIs), computed over 5,000 permutations of rater demographics. All green points lie within their null CIs.
Figure S1: The prompt we use to generate safety ratings of a user-chatbot conversation. This prompt follows the analyze-rate structure described in chiang_closer_2023. We also test a version of this prompt, rating-only, which removes the analyze step. In Appendix \ref{['sec:appendix_prompt']}, we describe the considerations involved in prompt design.
Figure S2: The prompt we use to study if GPT-4 can predict annotation disagreement: the extent to which one group of annotators rates a conversation more unsafe than another group of annotators. GROUP_A and GROUP_B are replaced by annotator race groups in our dataset, such as "white" and "Latinx". The model's Likert scores are compared to the true differences in mean group safety rating, $\mu_{G_B} - \mu_{G_A}$. We observe no statistically significant correlations for any of the tested group pairs.

Annotation alignment: Comparing LLM and human annotations of conversational safety

TL;DR

Abstract

Annotation alignment: Comparing LLM and human annotations of conversational safety

Authors

TL;DR

Abstract

Table of Contents

Figures (4)