Table of Contents
Fetching ...

SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes

Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, Sunipa Dev

TL;DR

This work tackles the gap in multilingual safety and fairness evaluations for generative models by introducing SeeGULL Multilingual ($SGM$), a global-scale dataset of 25,861 stereotypes spanning 20 languages and 23 regions. The authors combine LLM-assisted generation with culturally situated human annotations to create a rich resource of stereotype identities and attributes, including mean offensiveness scores, enabling cross-cultural model evaluation and gap analysis relative to English-centric resources ($SGE$). They provide a thorough methodology for identifying identity terms, generating associations, and obtaining region-specific annotations, plus extensive dataset statistics, overlap analyses, and gendered-demonyms analysis. The results demonstrate substantial cross-language, cross-region variation in stereotypes and offensiveness, and show that current foundation models endorse stereotypes differently across languages, underscoring the need for multilingual safety benchmarks with geo-cultural grounding.

Abstract

While generative multilingual models are rapidly being deployed, their safety and fairness evaluations are largely limited to resources collected in English. This is especially problematic for evaluations targeting inherently socio-cultural phenomena such as stereotyping, where it is important to build multi-lingual resources that reflect the stereotypes prevalent in respective language communities. However, gathering these resources, at scale, in varied languages and regions pose a significant challenge as it requires broad socio-cultural knowledge and can also be prohibitively expensive. To overcome this critical gap, we employ a recently introduced approach that couples LLM generations for scale with culturally situated validations for reliability, and build SeeGULL Multilingual, a global-scale multilingual dataset of social stereotypes, containing over 25K stereotypes, spanning 20 languages, with human annotations across 23 regions, and demonstrate its utility in identifying gaps in model evaluations. Content warning: Stereotypes shared in this paper can be offensive.

SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes

TL;DR

This work tackles the gap in multilingual safety and fairness evaluations for generative models by introducing SeeGULL Multilingual (), a global-scale dataset of 25,861 stereotypes spanning 20 languages and 23 regions. The authors combine LLM-assisted generation with culturally situated human annotations to create a rich resource of stereotype identities and attributes, including mean offensiveness scores, enabling cross-cultural model evaluation and gap analysis relative to English-centric resources (). They provide a thorough methodology for identifying identity terms, generating associations, and obtaining region-specific annotations, plus extensive dataset statistics, overlap analyses, and gendered-demonyms analysis. The results demonstrate substantial cross-language, cross-region variation in stereotypes and offensiveness, and show that current foundation models endorse stereotypes differently across languages, underscoring the need for multilingual safety benchmarks with geo-cultural grounding.

Abstract

While generative multilingual models are rapidly being deployed, their safety and fairness evaluations are largely limited to resources collected in English. This is especially problematic for evaluations targeting inherently socio-cultural phenomena such as stereotyping, where it is important to build multi-lingual resources that reflect the stereotypes prevalent in respective language communities. However, gathering these resources, at scale, in varied languages and regions pose a significant challenge as it requires broad socio-cultural knowledge and can also be prohibitively expensive. To overcome this critical gap, we employ a recently introduced approach that couples LLM generations for scale with culturally situated validations for reliability, and build SeeGULL Multilingual, a global-scale multilingual dataset of social stereotypes, containing over 25K stereotypes, spanning 20 languages, with human annotations across 23 regions, and demonstrate its utility in identifying gaps in model evaluations. Content warning: Stereotypes shared in this paper can be offensive.
Paper Structure (33 sections, 6 figures, 9 tables)

This paper contains 33 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Examples from $\emph{SeeGULL Multilingual}$. Lang. (Language): es: Spanish, fr: French, it: Italian, sw: Swahili, fr: French; S: # of annotators (out of 3) who reported it as a stereotype; O: mean offensiveness rating of the stereotype.
  • Figure 2: Example differences in known stereotypes in the same language across two different countries. S($C_i$) is the # annotators marking the tuple as stereotype in country $C_i$. Countries are denoted by their ISO codes.
  • Figure 3: Offensive Annotations for nationalities of the world. We take all the stereotypes along the nationality axis, and find the average mean offensive score, corresponding to each country. The countries having the darker shades of red, have on an average, more offensive stereotypes associated with them.
  • Figure 4: Example of highly offensive stereotypes. The column country denotes the country of annotation.
  • Figure 5: Example of evaluation prompt in Bengali and English translation. The stereotypical identity associated with the blue attribute is highlighted in orange.
  • ...and 1 more figures