DICES Dataset: Diversity in Conversational AI Evaluation for Safety

Lora Aroyo; Alex S. Taylor; Mark Diaz; Christopher M. Homan; Alicia Parrish; Greg Serapio-Garcia; Vinodkumar Prabhakaran; Ding Wang

DICES Dataset: Diversity in Conversational AI Evaluation for Safety

Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, Ding Wang

TL;DR

The paper introduces DICES, a diverse, replication-rich dataset for evaluating safety in conversational AI, capturing subjective safety judgments across demographic groups with high per-item ratings and expert labels. It documents a five-step data collection process, two conversation corpora (DICES-990 and DICES-350), and extensive demographic and timing data to enable nuanced analyses of safety, disagreement, and aggregation strategies. The work highlights substantial cross-demographic variation in safety opinions and questions the reliability of traditional gold labels, proposing DICES as a benchmark to study and incorporate diverse perspectives in safety evaluation and model alignment. The dataset thus provides a foundational resource for exploring ambiguity, rater disagreement, and demographic intersections in safety assessments of language models.

Abstract

Machine learning approaches often require training and evaluation datasets with a clear separation between positive and negative examples. This risks simplifying and even obscuring the inherent subjectivity present in many tasks. Preserving such variance in content and diversity in datasets is often expensive and laborious. This is especially troubling when building safety datasets for conversational AI systems, as safety is both socially and culturally situated. To demonstrate this crucial aspect of conversational AI safety, and to facilitate in-depth model performance analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for Safety) dataset that contains fine-grained demographic information about raters, high replication of ratings per item to ensure statistical power for analyses, and encodes rater votes as distributions across different demographics to allow for in-depth explorations of different aggregation strategies. In short, the DICES dataset enables the observation and measurement of variance, ambiguity, and diversity in the context of conversational AI safety. We also illustrate how the dataset offers a basis for establishing metrics to show how raters' ratings can intersects with demographic categories such as racial/ethnic groups, age groups, and genders. The goal of DICES is to be used as a shared resource and benchmark that respects diverse perspectives during safety evaluation of conversational AI systems.

DICES Dataset: Diversity in Conversational AI Evaluation for Safety

TL;DR

Abstract

Paper Structure (16 sections, 5 figures, 2 tables)

This paper contains 16 sections, 5 figures, 2 tables.

Introduction
Contributions
Related Work
Data Collection Methodology
Corpus Creation
Sample Curation
Rater Pool Selection
Safety Annotation Task
Expert Annotation Task
DICES Dataset
Discussion and Limitations
Limitations
Statement of Ethics
Raters Consent Form
Raters Demographics Survey
...and 1 more sections

Figures (5)

Figure 1: Demographic breakdown of annotators. Two illustrative plots of annotators by racial/ethnic groups and gender (left) and racial/ethnic groups and age groups (right).
Figure 2: Screenshot of the raters' user interface for the Safety Annotation Task: illustrates the annotation category for policy violations. The left panel presents the conversation; raters assess the last conversational turn (highlighted). The right panel presents two policy related sub-questions.
Figure 3: Breakdown of topics and degree of harm for DICES-350. Percentages of conversations per topic (left) and number of conversations per degree of harm (right).
Figure 4: Within-group agreement metrics, by race. IRR shows that Latine raters have significantly more agreement than other races. Negentropy (i.e. negative of entropy) and plurality size (i.e. the fraction of raters who choose the most popular response) show that White raters have significantly more, and Multiracial significantly less, agreement than other races.
Figure 5: Illustrative comparison between demographic sub-groups. The left graph shows rating counts for male and female annotators. The right graph shows counts for the 5 racial/ethnic groups.

DICES Dataset: Diversity in Conversational AI Evaluation for Safety

TL;DR

Abstract

DICES Dataset: Diversity in Conversational AI Evaluation for Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (5)