Table of Contents
Fetching ...

Do Large Language Models Reflect Demographic Pluralism in Safety?

Usman Naseem, Gautam Siddharth Kashyap, Sushant Kumar Ray, Rafiq Ali, Ebad Shabbir, Abdullah Mohammad

TL;DR

This work tackles the challenge of demographic pluralism in LLM safety by introducing Demo-SafetyBench, a two-stage, prompt-level framework that isolates demographic variation from model responses. Stage I reclassifies DICES prompts into 14 BeaverTails-derived safety domains, preserves demographic metadata, expands low-resource domains with conditional generation, and deduplicates the corpus to 43,050 samples. Stage II benchmarks pluralistic safety by evaluating prompts with zero-shot LLMs-as-Raters (Gemma-7B, GPT-4o, LLaMA-2-7B), yielding reliability and demographic-sensitivity metrics such as $ ext{ICC}=0.87$ and $ ext{DS}=0.12$, while revealing how model scale and alignment affect cross-demographic judgments. The results show that scalable, demographically robust evaluation is feasible, yet even strong models retain residual demographic sensitivity, highlighting the need for demographically aware alignment practices with practical implications for safety assessment and policy design. The approach provides a principled method to quantify safety perception across diverse populations, informing more inclusive and culturally aware AI safety standards.

Abstract

Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.

Do Large Language Models Reflect Demographic Pluralism in Safety?

TL;DR

This work tackles the challenge of demographic pluralism in LLM safety by introducing Demo-SafetyBench, a two-stage, prompt-level framework that isolates demographic variation from model responses. Stage I reclassifies DICES prompts into 14 BeaverTails-derived safety domains, preserves demographic metadata, expands low-resource domains with conditional generation, and deduplicates the corpus to 43,050 samples. Stage II benchmarks pluralistic safety by evaluating prompts with zero-shot LLMs-as-Raters (Gemma-7B, GPT-4o, LLaMA-2-7B), yielding reliability and demographic-sensitivity metrics such as and , while revealing how model scale and alignment affect cross-demographic judgments. The results show that scalable, demographically robust evaluation is feasible, yet even strong models retain residual demographic sensitivity, highlighting the need for demographically aware alignment practices with practical implications for safety assessment and policy design. The approach provides a principled method to quantify safety perception across diverse populations, informing more inclusive and culturally aware AI safety standards.

Abstract

Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.
Paper Structure (16 sections, 4 equations, 7 figures, 5 tables)

This paper contains 16 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Safety taxonomy used in Demo-SafetyBench. We adapt the BeaverTailsji2023beavertails taxonomy into fourteen text-safety domains that ground Stage I reclassification of DICESNEURIPS2023_a74b697b prompts.
  • Figure 2: Overview of the Demo-SafetyBench pipeline. The framework comprises two stages: Stage I constructs a demographically diversified, prompt-level corpus by reclassifying and expanding DICES queries across 14 safety domains using Mistral-7B; Stage II benchmarks pluralistic safety by evaluating these prompts with LLMs-as-Raters (Gemma-7B, GPT-4o, LLaMA-2-7B) under zero-shot inferences.
  • Figure 3: Multi-label classification in Stage I using the Demo-SafetyBench taxonomy. Mistral-7B-Instruct-v0.3 predicts per-domain probabilities; labels above $\delta{=}0.5$ are selected, enabling multi-domain assignment when appropriate.
  • Figure 4: Conditional query generation in Stage I for low-resource domains. Each synthetic query $q'_k$ is generated using Llama-3.1-8B-Instruct, conditioned on both the safety domain label $y_j$ and the sampled demographic prior $\mathbf{d} \sim p_{\text{demo}}(\mathbf{d})$. This preserves proportional demographic representation across categories while expanding under-represented domains.
  • Figure 5: Evaluation protocol in Stage II. Each prompt–demographic pair $(q_i, \mathbf{d}_i)$ is formatted into a structured input $\mathbf{x}_i$ and passed to a rater model $f_m$. The model outputs both a categorical label and a self-calibrated numerical confidence score, $s_{i,m} \in [0,1]$, representing its intrinsic assessment of safety. This unified dual-output schema enables consistent, interpretable pluralistic evaluation across all raters. To ensure comparability across raters, all raw confidence values returned by the models were normalized to the range [0,1] via direct numeric parsing of the model outputs, without temperature or logit scaling.
  • ...and 2 more figures