Table of Contents
Fetching ...

Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups

Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Zoe Ashwood, Aida Mostafazadeh Davani, Mark Diaz, Michela Paganini, Alicia Parrish, Ding Wang, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser

TL;DR

A large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters balanced across 30 intersectional groups across age, gender, and ethnicity shows significant differences across demographic groups on how severe they assess the harm to be.

Abstract

AI systems crucially rely on human ratings, but these ratings are often aggregated, obscuring the inherent diversity of perspectives in real-world phenomenon. This is particularly concerning when evaluating the safety of generative AI, where perceptions and associated harms can vary significantly across socio-cultural contexts. While recent research has studied the impact of demographic differences on annotating text, there is limited understanding of how these subjective variations affect multimodal safety in generative AI. To address this, we conduct a large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters balanced across 30 intersectional groups across age, gender, and ethnicity. Our study shows that (1) there are significant differences across demographic groups (including intersectional groups) on how severe they assess the harm to be, and that these differences vary across different types of safety violations, (2) the diverse rater pool captures annotation patterns that are substantially different from expert raters trained on specific set of safety policies, and (3) the differences we observe in T2I safety are distinct from previously documented group level differences in text-based safety tasks. To further understand these varying perspectives, we conduct a qualitative analysis of the open-ended explanations provided by raters. This analysis reveals core differences into the reasons why different groups perceive harms in T2I generations. Our findings underscore the critical need for incorporating diverse perspectives into safety evaluation of generative AI ensuring these systems are truly inclusive and reflect the values of all users.

Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups

TL;DR

A large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters balanced across 30 intersectional groups across age, gender, and ethnicity shows significant differences across demographic groups on how severe they assess the harm to be.

Abstract

AI systems crucially rely on human ratings, but these ratings are often aggregated, obscuring the inherent diversity of perspectives in real-world phenomenon. This is particularly concerning when evaluating the safety of generative AI, where perceptions and associated harms can vary significantly across socio-cultural contexts. While recent research has studied the impact of demographic differences on annotating text, there is limited understanding of how these subjective variations affect multimodal safety in generative AI. To address this, we conduct a large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters balanced across 30 intersectional groups across age, gender, and ethnicity. Our study shows that (1) there are significant differences across demographic groups (including intersectional groups) on how severe they assess the harm to be, and that these differences vary across different types of safety violations, (2) the diverse rater pool captures annotation patterns that are substantially different from expert raters trained on specific set of safety policies, and (3) the differences we observe in T2I safety are distinct from previously documented group level differences in text-based safety tasks. To further understand these varying perspectives, we conduct a qualitative analysis of the open-ended explanations provided by raters. This analysis reveals core differences into the reasons why different groups perceive harms in T2I generations. Our findings underscore the critical need for incorporating diverse perspectives into safety evaluation of generative AI ensuring these systems are truly inclusive and reflect the values of all users.

Paper Structure

This paper contains 37 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Figure shows the rate of disagreement and rate of occurrence at different thresholds for all diverse raters vs. experts raters across different violation types. The band around the plots gives the 95% confidence interval with sample size being $n$.
  • Figure 2: Figure shows the rate of disagreement and rate of occurrence at different thresholds for groups of diverse raters vs. expert raters, where grouping is by (a) age, (b) ethnicity, and (c) gender.
  • Figure 3:
  • Figure 4: Histograms for the frequency of the expert label and the plurality score per prompt-image pair. The average frequency of the expert label is 4.09, i.e., more than 4 experts gave the same annotation per prompt-image pair on average. The average frequency of the plurality score is 15.28, i.e., on average more than 15 raters gave the same score per prompt-image pair.
  • Figure 5: Figure shows the rate of disagreement and rate of occurrence at different thresholds for ethnic groups of diverse raters vs. expert raters on the three violation types (a) bias, (b) sexual, and (c) violent.
  • ...and 2 more figures