Table of Contents
Fetching ...

Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens

Mai AlKhamissi, Yunze Xiao, Badr AlKhamissi, Mona Diab

TL;DR

This paper argues that current culture benchmarks in NLP reduce culture to static facts or homogeneous preferences, clashing with anthropological views of culture as dynamic and situated. It offers a four-part taxonomy—Culture-as-Knowledge, Culture-as-Preference, Culture-as-Dynamics, and Culture-as-Bias—and uses it to analyze 20 benchmarks, revealing six recurrent methodological issues. The authors propose concrete improvements, including real-world narratives, participatory design, contextual evaluation, and treating disagreement as a data signal, to build more nuanced benchmarks. By bridging social science with NLP practice, the work provides a roadmap for evaluating and mitigating cultural biases while capturing the lived, contested nature of culture in AI systems.

Abstract

Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.

Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens

TL;DR

This paper argues that current culture benchmarks in NLP reduce culture to static facts or homogeneous preferences, clashing with anthropological views of culture as dynamic and situated. It offers a four-part taxonomy—Culture-as-Knowledge, Culture-as-Preference, Culture-as-Dynamics, and Culture-as-Bias—and uses it to analyze 20 benchmarks, revealing six recurrent methodological issues. The authors propose concrete improvements, including real-world narratives, participatory design, contextual evaluation, and treating disagreement as a data signal, to build more nuanced benchmarks. By bridging social science with NLP practice, the work provides a roadmap for evaluating and mitigating cultural biases while capturing the lived, contested nature of culture in AI systems.

Abstract

Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.

Paper Structure

This paper contains 32 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Cultural Framing in NLP. Our taxonomy of how culture is framed in NLP evaluation. Each quadrant represents a distinct theoretical lens on culture: defining what it entails, illustrating how it is expressed, and providing two representative benchmarks for each framing.
  • Figure 2: Mapping of 20 Cultural Benchmarks to the 4 Cultural Dimensions We map the benchmarks analyzed in this work onto the taxonomy proposed in §\ref{['sec:tax']}. Each benchmark inherits the dimension of its parent(s).