Table of Contents
Fetching ...

Culture Cartography: Mapping the Landscape of Cultural Knowledge

Caleb Ziems, William Held, Jane Yu, Amir Goldberg, David Grusky, Diyi Yang

TL;DR

Culture Cartography introduces a mixed-initiative framework that pairs LLM-driven question generation with human edits to identify culture-specific knowledge gaps, implemented via the Culture Explorer tool. Across Nigeria and Indonesia, this approach yields data that are more challenging for leading models to recall and are not easily surfaced by web search, supporting its Google-Proof claim. Transfer experiments show that fine-tuning small to mid-size models on Culture Cartography data improves downstream performance on culture benchmarks beyond traditional data, reducing reliance on web-based knowledge. The work demonstrates the value of participatory, multilingual knowledge-elicitation for building culturally aware NLP systems and highlights ethical considerations, biases, and the scope beyond mere knowledge representation.

Abstract

To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.

Culture Cartography: Mapping the Landscape of Cultural Knowledge

TL;DR

Culture Cartography introduces a mixed-initiative framework that pairs LLM-driven question generation with human edits to identify culture-specific knowledge gaps, implemented via the Culture Explorer tool. Across Nigeria and Indonesia, this approach yields data that are more challenging for leading models to recall and are not easily surfaced by web search, supporting its Google-Proof claim. Transfer experiments show that fine-tuning small to mid-size models on Culture Cartography data improves downstream performance on culture benchmarks beyond traditional data, reducing reliance on web-based knowledge. The work demonstrates the value of participatory, multilingual knowledge-elicitation for building culturally aware NLP systems and highlights ethical considerations, biases, and the scope beyond mere knowledge representation.

Abstract

To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.

Paper Structure

This paper contains 31 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Culture Cartography is a new method for identifying culturally-salient knowledge gaps in LLMs. Prior methods are single-initiative: either a human determines the distribution (Knowledge Extraction), which may not challenge models, or an LLM decides on challenging questions (Traditional Annotation), which may not represent human interests. Culture Cartography is the first mixed-initiative method that combines four key ingredients: (1)an LLM proposes challenging questions; (2)a human proposes salient questions; (3) human edits constrain subsequent LLM generations; and (4)the data forms a tree structure. Compared to prior methods, Culture Cartography identifies more LLM knowledge gaps.
  • Figure 2: The Culture Explorer interface allows human experts to lead the annotation process, as they can Edit, Regenerate, or Delete nodes at any time. Cultural Knowledge annotation is initiated with [fill color=black,inner color=white,]A [fill color=black,inner color=white,]A a seed topic (here: gifts), which the LLM uses to generate [fill color=black,inner color=white,]B [fill color=black,inner color=white,]B Question nodes. Here, the annotator is editing the first Question node to make it more specific to her Islamic culture. Each Question will serve as a seed for the LLM to generate [fill color=black,inner color=white,]C [fill color=black,inner color=white,]C Answer nodes. The user can then pick the questions and answers interests her, clarify through edits, or write her own from scratch, iteratively expanding the tree with [fill color=black,inner color=white,]D [fill color=black,inner color=white,]D deeper follow-up questions and answers.
  • Figure 3: Performance on Culture Cartography. Powerful models like DeepSeek R1 can entirely solve Synthetic Data ($R@100\geq98$%), and also perform well on Traditional Annotation data ($R@100\leq$92%). Most importantly, Culture Cartography data is appreciably harder than these single-initiative data sources, with moderate and statistically significant effect sizes (ns = "not significant"; $^{*}$$p < 0.05$; $^{**}$$p < 0.01$; $^{***}$$p < 0.001$; $^{****}$$p < 0.0001$) for both R1 ($d=0.17$ Indonesia; $d=0.29$ Nigeria) and GPT-4o ($d=0.20$ Indonesia; $d=0.32$ Nigeria).
  • Figure 4: Recall@K curves for DeepSeek R1, GPT-4o, and Qwen2-72B on Synthetic Data demonstrate that model performances either plateau or reach 100% by $K=100$