Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity
Pushkar Mishra, Charvi Rastogi, Stephen R. Pfohl, Alicia Parrish, Tian Huey Teh, Roma Patel, Mark Diaz, Ding Wang, Michela Paganini, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser
TL;DR
This work tackles pluralistic safety feedback by introducing a non-parametric framework to quantify how diverse raters express severity on ordinal scales. It defines two properties—stochastic ordering and discrimination—and develops Monotonic Precision Area ($MPA$) and Weighted Recall Area ($WRA$) as area-based metrics to assess responsiveness to severity, using a binary reference $U$ derived from either guideline-based $V^g$ or crowd-based $V^c$. The method is validated on two public datasets, revealing meaningful cross-demographic differences in severity perception and demonstrating how these metrics can guide rater selection, data collection, and alignment strategies for safer, multi-cultural AI systems. Compared to traditional metrics and parametric models, the proposed approach captures both broad and granular variations in pluralistic safety feedback, offering practical insights for robust AI alignment in diverse populations.
Abstract
Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for analyzing ordinal safety ratings in pluralistic settings. Specifically, we address the challenge of interpreting nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We define non-parametric responsiveness metrics that quantify how raters convey broader distinctions and granular variations in the severity of safety violations. Leveraging publicly available datasets of pluralistic safety feedback as our case studies, we investigate how raters from different demographic groups use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI alignment.
