Table of Contents
Fetching ...

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh

TL;DR

BLEnD introduces a multilingual, hand-crafted benchmark to evaluate LLMs on everyday cultural knowledge across 16 regions and 13 languages, capturing six socio-cultural categories through 52.6k QA pairs in SAQ and MCQ formats. The dataset enables direct cross-cultural comparisons using a common question set and native-language annotations, and reveals substantial performance gaps for underrepresented cultures, particularly when prompted in local languages. The evaluation shows region- and language-specific models can outperform general models in their own contexts, while overall results highlight the need for more diverse, culturally representative training data. The work provides a publicly available resource and a rigorous evaluation framework that can guide the development of culturally sensitive multilingual LLMs with practical implications for global user bases.

Abstract

Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

TL;DR

BLEnD introduces a multilingual, hand-crafted benchmark to evaluate LLMs on everyday cultural knowledge across 16 regions and 13 languages, capturing six socio-cultural categories through 52.6k QA pairs in SAQ and MCQ formats. The dataset enables direct cross-cultural comparisons using a common question set and native-language annotations, and reveals substantial performance gaps for underrepresented cultures, particularly when prompted in local languages. The evaluation shows region- and language-specific models can outperform general models in their own contexts, while overall results highlight the need for more diverse, culturally representative training data. The work provides a publicly available resource and a rigorous evaluation framework that can guide the development of culturally sensitive multilingual LLMs with practical implications for global user bases.

Abstract

Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.
Paper Structure (39 sections, 2 equations, 15 figures, 13 tables)

This paper contains 39 sections, 2 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: The overall framework of dataset construction and LLM evaluation on BLEnD. BLEnD is built through 4 steps: question collection, question filtering & translation, answer annotation, and answer aggregation. The dataset includes the same questions in 13 different languages, answered from 16 different countries/regions. We evaluate LLMs by short-answer and multiple-choice questions.
  • Figure 2: Heatmap showing the average number of common lemmas within each question between all country/region pairs. Pairs from the same countries/regions are shown in white. Higher numbers of shared lemmas indicate that those countries/regions provide more similar answers compared to other countries/regions (e.g., Indonesia and West Java).
  • Figure 3: (a) LLMs' performance on short answer questions for each country/region in the local language. Models constructed from a Western country are shown in shades of blue, whereas those built from a non-Western country are shown in shades of red. (b) Average performance of all LLMs in local language and English on short answer questions. The grey error bars indicate the standard deviations among all models.
  • Figure 4: LLMs' performance on multiple-choice questions. Models constructed from a Western country are shown in shades of blue, whereas those built from a non-Western country are shown in shades of red. Similar to the results from short-answer questions, models tend to show lower performance in underrepresented countries/regions.
  • Figure 5: Example annotations for a cultural question related to the topic of food for each country/region in our dataset. The questions and annotations are provided in different languages, with translations of the annotated answers into English included in brackets. Annotations are sorted in descending order based on the frequency (i.e., vote count) of an answer provided by annotators, each separated by a line break. The vote count for each answer is displayed as numbers.
  • ...and 10 more figures