Table of Contents
Fetching ...

Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S

Christabel Acquaye, Haozhe An, Rachel Rudinger

Abstract

Recent work has highlighted the culturally-contingent nature of commonsense knowledge. We introduce AMAMMER$ε$, a test set of 525 multiple-choice questions designed to evaluate the commonsense knowledge of English LLMs, relative to the cultural contexts of Ghana and the United States. To create AMAMMER$ε$, we select a set of multiple-choice questions (MCQs) from existing commonsense datasets and rewrite them in a multi-stage process involving surveys of Ghanaian and U.S. participants. In three rounds of surveys, participants from both pools are solicited to (1) write correct and incorrect answer choices, (2) rate individual answer choices on a 5-point Likert scale, and (3) select the best answer choice from the newly-constructed MCQ items, in a final validation step. By engaging participants at multiple stages, our procedure ensures that participant perspectives are incorporated both in the creation and validation of test items, resulting in high levels of agreement within each pool. We evaluate several off-the-shelf English LLMs on AMAMMER$ε$. Uniformly, models prefer answers choices that align with the preferences of U.S. annotators over Ghanaian annotators. Additionally, when test items specify a cultural context (Ghana or the U.S.), models exhibit some ability to adapt, but performance is consistently better in U.S. contexts than Ghanaian. As large resources are devoted to the advancement of English LLMs, our findings underscore the need for culturally adaptable models and evaluations to meet the needs of diverse English-speaking populations around the world.

Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S

Abstract

Recent work has highlighted the culturally-contingent nature of commonsense knowledge. We introduce AMAMMER, a test set of 525 multiple-choice questions designed to evaluate the commonsense knowledge of English LLMs, relative to the cultural contexts of Ghana and the United States. To create AMAMMER, we select a set of multiple-choice questions (MCQs) from existing commonsense datasets and rewrite them in a multi-stage process involving surveys of Ghanaian and U.S. participants. In three rounds of surveys, participants from both pools are solicited to (1) write correct and incorrect answer choices, (2) rate individual answer choices on a 5-point Likert scale, and (3) select the best answer choice from the newly-constructed MCQ items, in a final validation step. By engaging participants at multiple stages, our procedure ensures that participant perspectives are incorporated both in the creation and validation of test items, resulting in high levels of agreement within each pool. We evaluate several off-the-shelf English LLMs on AMAMMER. Uniformly, models prefer answers choices that align with the preferences of U.S. annotators over Ghanaian annotators. Additionally, when test items specify a cultural context (Ghana or the U.S.), models exhibit some ability to adapt, but performance is consistently better in U.S. contexts than Ghanaian. As large resources are devoted to the advancement of English LLMs, our findings underscore the need for culturally adaptable models and evaluations to meet the needs of diverse English-speaking populations around the world.

Paper Structure

This paper contains 39 sections, 25 figures, 9 tables.

Figures (25)

  • Figure 2: Overall pipeline for our test set generation shown from left to right. It starts with disambiguating sampled questions (§\ref{['sec:disambiguation']}) from our select dataset for unspecified (UN), Ghanaian (GH), and American (US) cultural settings for context (C). Participants provide free-form answers in the Generation Stage (§\ref{['sec:generation']}). Annotators rate these answers in the Likert Scale Answer Annotation task (§\ref{['sec:likert']}). The most and least favored answers for the two cultures is selected and used to construct the MCQs for the human baseline annotations (§\ref{['sec:multiple']}).
  • Figure 3: Selected Model Preference Distribution in GH Specified Settings. * represents values less than 5. See full results including other additional models in Table \ref{['tab:full_GH_US']}.
  • Figure 4: Selected Model Preference Distribution in US Specified Settings. * represents values less than 5.1. See full results including other additional models in Table \ref{['tab:full_GH_US']}.
  • Figure 5: Accuracy of models in Ghanaian settings when conditioned only on Ghanaian correct and distractor answers. See full results including other additional models in Figure \ref{['fig:gh_gh_only']}.
  • Figure 6: Accuracy of models in US settings when conditioned only on US correct and distractor answers. * represents values less than 5. See full results including other additional models in Figure \ref{['fig:us_us_only']}.
  • ...and 20 more figures