Table of Contents
Fetching ...

Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity

Pushkar Mishra, Charvi Rastogi, Stephen R. Pfohl, Alicia Parrish, Tian Huey Teh, Roma Patel, Mark Diaz, Ding Wang, Michela Paganini, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser

TL;DR

This work tackles pluralistic safety feedback by introducing a non-parametric framework to quantify how diverse raters express severity on ordinal scales. It defines two properties—stochastic ordering and discrimination—and develops Monotonic Precision Area ($MPA$) and Weighted Recall Area ($WRA$) as area-based metrics to assess responsiveness to severity, using a binary reference $U$ derived from either guideline-based $V^g$ or crowd-based $V^c$. The method is validated on two public datasets, revealing meaningful cross-demographic differences in severity perception and demonstrating how these metrics can guide rater selection, data collection, and alignment strategies for safer, multi-cultural AI systems. Compared to traditional metrics and parametric models, the proposed approach captures both broad and granular variations in pluralistic safety feedback, offering practical insights for robust AI alignment in diverse populations.

Abstract

Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for analyzing ordinal safety ratings in pluralistic settings. Specifically, we address the challenge of interpreting nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We define non-parametric responsiveness metrics that quantify how raters convey broader distinctions and granular variations in the severity of safety violations. Leveraging publicly available datasets of pluralistic safety feedback as our case studies, we investigate how raters from different demographic groups use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI alignment.

Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity

TL;DR

This work tackles pluralistic safety feedback by introducing a non-parametric framework to quantify how diverse raters express severity on ordinal scales. It defines two properties—stochastic ordering and discrimination—and develops Monotonic Precision Area () and Weighted Recall Area () as area-based metrics to assess responsiveness to severity, using a binary reference derived from either guideline-based or crowd-based . The method is validated on two public datasets, revealing meaningful cross-demographic differences in severity perception and demonstrating how these metrics can guide rater selection, data collection, and alignment strategies for safer, multi-cultural AI systems. Compared to traditional metrics and parametric models, the proposed approach captures both broad and granular variations in pluralistic safety feedback, offering practical insights for robust AI alignment in diverse populations.

Abstract

Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for analyzing ordinal safety ratings in pluralistic settings. Specifically, we address the challenge of interpreting nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We define non-parametric responsiveness metrics that quantify how raters convey broader distinctions and granular variations in the severity of safety violations. Leveraging publicly available datasets of pluralistic safety feedback as our case studies, we investigate how raters from different demographic groups use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI alignment.

Paper Structure

This paper contains 28 sections, 15 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Monotonic precision area (mpa), weighted recall area (wra), their harmonic mean (hm), Kendall's $\tau$, and auroc for trisectional demographic groups of crowd raters when binary reference $U$ is obtained from expert raters. All confidence intervals are within $\pm 0.01$. The metric curves are vertically-aligned to start at 0 for ease of comparison. Legend gives the values by which the curves are translated.
  • Figure 2: Monotonic precision area (mpa), weighted recall area (wra), their harmonic mean (hm), Kendall's $\tau$, auroc, Mokken h, and irt discrimination $\alpha$ for trisectional demographic groups of crowd raters when binary reference $U$ is obtained from crowd raters, excluding the group being evaluated. All confidence intervals are within $\pm 0.01$. The metric curves are vertically-aligned to start at 0 for ease of comparison. Legend gives the values by which the curves are translated.
  • Figure 3: Monotonic precision area (mpa), weighted recall area (wra), their harmonic mean (hm), Kendall's $\tau$, auroc, and irt discrimination $\alpha$ for trisectional groups of diverse raters. All confidence intervals are within $\pm 0.01$. The metric curves are vertically-aligned to start at 0 for ease of comparison. Legend gives the values by which the curves are translated.
  • Figure 4: Some possible distributions of true severity $V^j$ as perceived by a rater $j$ against true severity $V^g$ or $V^c$ as captured by the reference for cases where (a) both mpa and wra of rater $j$ are low, (b) mpa of rater $j$ is low, and (c) wra of rater $j$ is low. Red dots represent items with $U = 1$ and blue dots represent items with $U = 0$. Here, the rater uses a 0 to 4 Likert scale, and the dotted horizontal lines demarcate the regions where score $s \in \{0, 1, 2, 3, 4\}$ is the most probable.
  • Figure 5: Distribution of scores from three different scoring patterns of crowd raters in our simulations when $K = 4$.
  • ...and 7 more figures