SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

Aishwarya Verma; Laud Ammah; Olivia Nercy Ndlovu Lucas; Andrew Zaldivar; Vinodkumar Prabhakaran; Sunipa Dev

SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev

TL;DR

This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa, and establishes a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality.

Abstract

Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.

SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

TL;DR

Abstract

Paper Structure (26 sections, 3 figures, 11 tables)

This paper contains 26 sections, 3 figures, 11 tables.

Introduction
Data Collection Methodology
Survey Mode
Local Partnerships
Interview Protocol
Data Capture & Localization
Notes on Retention of Nuance
Data Preparation & Annotation
The SAFARI Dataset
Dataset Characteristics
Analysis and Evaluations
Common Stereotype Attributes:
Prevalence of Stereotypes in Generative AI:
Discussion and Future Work
Dataset Details
...and 11 more sections

Figures (3)

Figure 1: Our community-engaged methodology towards curating the multilingual SAFARI dataset, comprising 3,206 stereotypes across four countries and 15 languages. See Table \ref{['tab:multilingualstats']} for the language distribution.
Figure 2: Country-wise distribution of dominant identity axes in the SAFARI stereotype dataset
Figure 3: The SAFARI dataset represents 15 native languages from 4 countries. Table \ref{['tab:multilingualstats']} provides the expanded ISO codes.

SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

TL;DR

Abstract

SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

Authors

TL;DR

Abstract

Table of Contents

Figures (3)