Table of Contents
Fetching ...

Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English

Fiifi Dawson, Zainab Mosunmola, Sahil Pocker, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

Abstract

Although LLMs have been extremely effective in a large number of complex tasks, their understanding and functionality for regional languages and cultures are not well studied. In this paper, we explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa). Using Hofstede's six cultural dimensions: Power Distance (PDI), Individualism (IDV), Motivation towards Achievement and Success (MAS), Uncertainty Avoidance (UAV), Long Term Orientation (LTO), and Indulgence (IVR), we quantify the cultural awareness of LLM-based responses. We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. We also highlight the need for large-scale regional language LLM training with culturally enriched datasets. This will have huge implications for enhancing the user experience of chat-based LLMs and also improving the validity of large-scale LLM agent-based market research.

Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English

Abstract

Although LLMs have been extremely effective in a large number of complex tasks, their understanding and functionality for regional languages and cultures are not well studied. In this paper, we explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa). Using Hofstede's six cultural dimensions: Power Distance (PDI), Individualism (IDV), Motivation towards Achievement and Success (MAS), Uncertainty Avoidance (UAV), Long Term Orientation (LTO), and Indulgence (IVR), we quantify the cultural awareness of LLM-based responses. We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. We also highlight the need for large-scale regional language LLM training with culturally enriched datasets. This will have huge implications for enhancing the user experience of chat-based LLMs and also improving the validity of large-scale LLM agent-based market research.
Paper Structure (20 sections, 3 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Large Language Models cannot uniformly capture the cultural dimensions. (A) This is an example of how the perception of hierarchy exists in humans. It is captured using Hofstede's cultural dimension called Power Distance Index (PDI) (B) Comparison of 3 cultural dimensions - Power Distance Index (PDI), Long-Term Orientation (LTO), and Individualism vs Collectivism (IDV) calculated using GPT-4o-mini vs ground truth in English.
  • Figure 2: Figure illustrates the regions we will focus on in our study. The state of Kerala, in the southern part of India, where $\approx$38 million people speak the Malayalam language, and parts of West Africa, where $\approx$45 million people speak the Yoruba language. These population numbers are much smaller than the 1.5 billion people who speak English. The figure also shows areas where English is spoken. These dots have been shown by larger circles to illustrate that the number of people who speak English is disproportionately higher than the number of people who speak regional languages like Malayalam and Yoruba.
  • Figure 3: Hofstede's Cultural Dimensions Theory is a framework for cross-cultural communication developed by Geert Hofstede. It describes the effects of a society's culture on the values of its members and how these values relate to behavior. Hofstede's original theory consists of six dimensions, as shown in the figure.
  • Figure 4: Pipeline for evaluating cultural similarity score across the 6 cultural metrics (PDI, IDV, UAI, MAS, LTO, IVR) for Yoruba and Malayalam language.
  • Figure 5: Summary of Malayalam respondent demographics. We have covered a wide (A) age group, (B) years of work experience, (C) education qualification, and (D) religions.
  • ...and 5 more figures