Culture Cartography: Mapping the Landscape of Cultural Knowledge
Caleb Ziems, William Held, Jane Yu, Amir Goldberg, David Grusky, Diyi Yang
TL;DR
Culture Cartography introduces a mixed-initiative framework that pairs LLM-driven question generation with human edits to identify culture-specific knowledge gaps, implemented via the Culture Explorer tool. Across Nigeria and Indonesia, this approach yields data that are more challenging for leading models to recall and are not easily surfaced by web search, supporting its Google-Proof claim. Transfer experiments show that fine-tuning small to mid-size models on Culture Cartography data improves downstream performance on culture benchmarks beyond traditional data, reducing reliance on web-based knowledge. The work demonstrates the value of participatory, multilingual knowledge-elicitation for building culturally aware NLP systems and highlights ethical considerations, biases, and the scope beyond mere knowledge representation.
Abstract
To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
