Table of Contents
Fetching ...

Building Knowledge-Guided Lexica to Model Cultural Variation

Shreya Havaldar, Salvatore Giorgi, Sunny Rai, Young-Min Cho, Thomas Talhelm, Sharath Chandra Guntuku, Lyle Ungar

TL;DR

This work introduces a new research problem for the NLP community: how to measure variation in cultural constructs across regions using language and provides a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding.

Abstract

Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs' failure to measure cultural variation or generate culturally varied language.

Building Knowledge-Guided Lexica to Model Cultural Variation

TL;DR

This work introduces a new research problem for the NLP community: how to measure variation in cultural constructs across regions using language and provides a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding.

Abstract

Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs' failure to measure cultural variation or generate culturally varied language.
Paper Structure (38 sections, 5 equations, 7 figures, 5 tables)

This paper contains 38 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We build knowledge-guided lexica to model cultural variation. Our method encodes domain knowledge via seed words based on cultural psychology theory. We use embeddings to transform these seed words into a high-validity lexical model that successfully measures cultural variation across the US.
  • Figure 2: Our knowledge-guided lexica creation method. We begin with a set of seed words curated by an expert psychologist. The first stage, Expansion, consists of synonym expansion and concept expansion, done in parallel. The second stage, Purification, includes frequency-based and correlation-based pruning, done sequentially.
  • Figure 3: Collectivism (red) and individualism (blue) across US counties. Dark red = higher collectivism and dark blue = higher individualism. We include 2042 counties with sufficient data to compute individualism/collectivism scores, along with 1095 counties with interpolated scores based on geographic and socio-demographic variables.
  • Figure 4: A comparison of collectivism (red) and individualism (blue) scores across communities defined by the American Communities Project, ordered from most individualistic (left) to least individualistic (right). We only analyze communities with over 40 included counties. Scores are 0-1 normalized.
  • Figure 5: Individualism score minus collectivism score for LLM-generated and real-world Tweets. Across four US states, Twitter data (green) more closely aligns with Vandello & Cohen's survey-based scores (yellow) compared to the GPT-3.5 data (purple).
  • ...and 2 more figures