NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

Abhinav Rao; Akhila Yerukola; Vishwa Shah; Katharina Reinecke; Maarten Sap

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, Maarten Sap

TL;DR

NormAd introduces a hierarchical framework to evaluate large language models' cultural adaptability beyond static knowledge, testing responses under RoT, Country, and Value+Country contexts. It operationalizes this framework with NormAd-Eti, a 2.6k-item dataset drawn from Cultural Atlas across 75 countries, augmented by rigorous automatic filtration and human validation. Findings show sizable gaps in adaptability, with models lagging humans particularly in Country and Value+Country scenarios, and a notable western-centric bias that persists despite larger model sizes and advanced alignment methods. The work underscores the need for user-driven cultural context and more nuanced evaluation to ensure global applicability and safe, respectful cross-cultural interactions. It provides a practical benchmark and methodological blueprint for future work toward culturally aware AI systems serving diverse populations.

Abstract

To be effectively and safely deployed to global user populations, large language models (LLMs) may need to adapt outputs to user values and cultures, not just know about them. We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability, specifically measuring their ability to judge social acceptability across varying levels of cultural norm specificity, from abstract values to explicit social norms. As an instantiation of our framework, we create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries. Through comprehensive experiments on NormAd-Eti, we find that LLMs struggle to accurately judge social acceptability across these varying degrees of cultural contexts and show stronger adaptability to English-centric cultures over those from the Global South. Even in the simplest setting where the relevant social norms are provided, the best LLMs' performance (< 82\%) lags behind humans (> 95\%). In settings with abstract values and country information, model performance drops substantially (< 60\%), while human accuracy remains high (> 90\%). Furthermore, we find that models are better at recognizing socially acceptable versus unacceptable situations. Our findings showcase the current pitfalls in socio-cultural reasoning of LLMs which hinder their adaptability for global audiences.

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

TL;DR

Abstract

Paper Structure (60 sections, 19 figures, 7 tables)

This paper contains 60 sections, 19 figures, 7 tables.

Introduction
Related work
Culture in LLMs
On Value Pluralism and Personalization of LLMs
NormAd Evaluation Framework
Rule-of-Thumb (RoT)
Country
Value+Country
NormAd-Eti Construction
Social Situation Description
Norm Sourcing
Social Situation Labels
Transforming Norms into Social Situation Descriptions
Automatic Filtration
Check 1: Entailment of RoT to Cultural Atlas's norms
...and 45 more sections

Figures (19)

Figure 1: We introduce NormAd, a framework for testing a language model's ability to adapt its responses when contextualized with varying levels of cultural information specificity, in contrast to prior methods that directly probe models for their knowledge. We show that LMs struggle to pick up cultural cues when provided with varying levels of context (Xs representing their incorrect responses, unlike humans, who can generally recognize such cues.)
Figure 2: Our NormAd-Eti construction pipeline consists of 4 parts: a) Generation: We source social etiquette-related social norms from Cultural Atlas and systematically transform them into grounded social situation description, RoT, and Value b) Filtration: We perform three rounds of automatic filtering and sanity checks to eliminate inconsistencies c) Validation: We conduct extensive human validation of the constructed dataset d) Human Performance: We conduct a small-scale assessment of human performance.
Figure 3: Comparison of accuracies across LLaMa-1-SFT (7b, 13b, 30b), LLaMa-2 (7b, 13b, 70b), OLMo7b (SFT/Chat), GPT-3.5-turbo, GPT-4, and Mistral over the all three contexts. Models perform significantly worse in Country and Country+Value contexts compared to the RoT context. Human performance for Country and Country+Value contexts are reported as a Green dashed line. Baseline performance (no context) is reported in Appendix \ref{['sec:app:scores_for_all_models']} and \ref{['sec:app::subsec:acc_all']}.
Figure 4: Comparision of model accuracies under Country + Value shows a notable performance skew, with top models (with increased size or improved preference alignment methods) performing better in social situations from English-speaking countries than in African-Islamic cultural regions.
Figure 5: Effect of preference alignment over the accuracies of LLaMa-1 models, against the RoT context. KTO improves performance significantly for 30b parameter models, with lesser improvement for 7b models.
...and 14 more figures

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

TL;DR

Abstract

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (19)