Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Yannis Karmim; Renato Pino; Hernan Contreras; Hernan Lira; Sebastian Cifuentes; Simon Escoffier; Luis Martí; Djamé Seddah; Valentin Barrière

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Yannis Karmim, Renato Pino, Hernan Contreras, Hernan Lira, Sebastian Cifuentes, Simon Escoffier, Luis Martí, Djamé Seddah, Valentin Barrière

TL;DR

This work proposes to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries.

Abstract

Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 9 figures, 4 tables)

This paper contains 34 sections, 1 equation, 9 figures, 4 tables.

Introduction and Related Work
Cultural Benchmark Creation
Language and Geographic Analysis
Our Approach
Benchmark Creation
Raw Wikipedia Data
Collection
Curation
Cultural Elements Distribution
Questions and Answers Generation
General Prompts
Questions Generation and Validation
Distractors Generations
Experiments and Results
Global Results and Prompting Language
...and 19 more sections

Figures (9)

Figure 1: Geographic distribution of LatamQA cultural MCQs across Latin America composed of 23k Q/As.
Figure 2: Distribution of the ratio of articles per cultural element per Language or Region in LatamQA. Cultural elements are: Anthroponyms (ANTHR), Forms of entertainment (ENTT), Local Institution (LOCAL), Toponyms (TOPO), Dialect (DIAL), Food and Drink (FOOD), Legal System (LEGAL), Scholastic reference (SCHOL), Religious celebration (RELIG), Fictional character (FICT).
Figure 3: Cross-country performance of Mistral models on LatamQA. Scaling from Small to Large yields consistent improvements (+5 -- +8% accuracy).
Figure 4: Performance of Mistral-large in Latam Spanish and Portuguese with respect to the different cultural elements.
Figure D.1: General Cultural Exploration approach prompt.
...and 4 more figures

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

TL;DR

Abstract

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Authors

TL;DR

Abstract

Table of Contents

Figures (9)