Table of Contents
Fetching ...

SafeWorld: Geo-Diverse Safety Alignment

Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, Nanyun Peng

TL;DR

SafeWorld introduces a geo-diverse safety alignment benchmark and a multi-dimensional evaluation framework to assess LLMs on culturally and legally sensitive queries across 50 countries and 493 regions. It documents a data-collection pipeline (GeoSafeDB) and four query types, coupled with a DPO-based alignment method (SafeWorldAlign) to train SafeWorldLM, which outperforms GPT-4o across contextual appropriateness, accuracy, and comprehensiveness. The framework combines automatic metrics (faithfulness, coverage, factuality) with human evaluation to demonstrate improvements in geo-safety alignment while preserving general NLP and safety performance. The work provides a scalable path toward culturally aware AI that respects diverse norms and policies, with public release of code and data supporting broader adoption.

Abstract

In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs' ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multi-dimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria. To enhance LLMs' alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment training. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation. Our code and data can be found here: https://github.com/PlusLabNLP/SafeWorld.

SafeWorld: Geo-Diverse Safety Alignment

TL;DR

SafeWorld introduces a geo-diverse safety alignment benchmark and a multi-dimensional evaluation framework to assess LLMs on culturally and legally sensitive queries across 50 countries and 493 regions. It documents a data-collection pipeline (GeoSafeDB) and four query types, coupled with a DPO-based alignment method (SafeWorldAlign) to train SafeWorldLM, which outperforms GPT-4o across contextual appropriateness, accuracy, and comprehensiveness. The framework combines automatic metrics (faithfulness, coverage, factuality) with human evaluation to demonstrate improvements in geo-safety alignment while preserving general NLP and safety performance. The work provides a scalable path toward culturally aware AI that respects diverse norms and policies, with public release of code and data supporting broader adoption.

Abstract

In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs' ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multi-dimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria. To enhance LLMs' alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment training. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation. Our code and data can be found here: https://github.com/PlusLabNLP/SafeWorld.

Paper Structure

This paper contains 57 sections, 1 equation, 16 figures, 23 tables.

Figures (16)

  • Figure 1: Examples of geo-diverse safety standards and the overall introduction of SafeWorld benchmark and its multi-dimensional evaluation.
  • Figure 2: The comparison between SafeWorld and other existing benchmarks.
  • Figure 3: Overview of queries generation pipeline. Based on GeoSafeDB, we generated four types of queries. We apply both machine and human validation to ensure high-quality generation.
  • Figure 4: SafeWorld query examples across four types. Some are paired with their corresponding reference (i.e., ground-truth) cultural-legal guidelines.
  • Figure 5: Overview of our multi-dimensional evaluation framework.
  • ...and 11 more figures