Table of Contents
Fetching ...

Evaluating Cultural Adaptability of a Large Language Model via Simulation of Synthetic Personas

Louis Kwok, Michal Bravansky, Lewis D. Griffin

TL;DR

Problem: evaluating cultural adaptability of LLMs. Approach: replicate a multinational persuasion experiment by simulating personas with nationality cues using GPT-3.5 and compare to human data. Findings: country-of-residence prompts improve alignment, while native-language prompting can degrade fidelity; Greek/Hebrew prompts are particularly detrimental. Significance: informs prompt design for culturally aware AI and motivates broader benchmarking across models.

Abstract

The success of Large Language Models (LLMs) in multicultural environments hinges on their ability to understand users' diverse cultural backgrounds. We measure this capability by having an LLM simulate human profiles representing various nationalities within the scope of a questionnaire-style psychological experiment. Specifically, we employ GPT-3.5 to reproduce reactions to persuasive news articles of 7,286 participants from 15 countries; comparing the results with a dataset of real participants sharing the same demographic traits. Our analysis shows that specifying a person's country of residence improves GPT-3.5's alignment with their responses. In contrast, using native language prompting introduces shifts that significantly reduce overall alignment, with some languages particularly impairing performance. These findings suggest that while direct nationality information enhances the model's cultural adaptability, native language cues do not reliably improve simulation fidelity and can detract from the model's effectiveness.

Evaluating Cultural Adaptability of a Large Language Model via Simulation of Synthetic Personas

TL;DR

Problem: evaluating cultural adaptability of LLMs. Approach: replicate a multinational persuasion experiment by simulating personas with nationality cues using GPT-3.5 and compare to human data. Findings: country-of-residence prompts improve alignment, while native-language prompting can degrade fidelity; Greek/Hebrew prompts are particularly detrimental. Significance: informs prompt design for culturally aware AI and motivates broader benchmarking across models.

Abstract

The success of Large Language Models (LLMs) in multicultural environments hinges on their ability to understand users' diverse cultural backgrounds. We measure this capability by having an LLM simulate human profiles representing various nationalities within the scope of a questionnaire-style psychological experiment. Specifically, we employ GPT-3.5 to reproduce reactions to persuasive news articles of 7,286 participants from 15 countries; comparing the results with a dataset of real participants sharing the same demographic traits. Our analysis shows that specifying a person's country of residence improves GPT-3.5's alignment with their responses. In contrast, using native language prompting introduces shifts that significantly reduce overall alignment, with some languages particularly impairing performance. These findings suggest that while direct nationality information enhances the model's cultural adaptability, native language cues do not reliably improve simulation fidelity and can detract from the model's effectiveness.
Paper Structure (22 sections, 2 equations, 8 figures, 4 tables)

This paper contains 22 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A specific human profile is defined, enriched with nationality or language, and evaluated against the ground-truth results from bos2020effects.
  • Figure 2: Format of a sample prompt used in the GPT-3.5 simulation. The prompt is intended to read like a semi-complete questionnaire, with the final numeric response (highlighted) provided by GPT-3.5. Key sections of the prompt are indicated by letters. a) Demographic information of the simulated participant. b) Relative deprivation ratings of the simulated participant in response to probe statements. c) The version of the news article shown to the simulated participant. In this example the anti-elite, anti-immigrant version is shown. d) The final instruction and a probe statement for GPT-3.5 to provide a single numerical response to.
  • Figure 3: Sign agreement rates for monolingual prompting in 12 different languages. Agreement rates significantly ($p < 0.05$) greater than chance are shown with black bars.
  • Figure 4: Sign agreement rates for monolingual and poly-lingual prompting. Vertical lines on bars indicate +/- 1 s.d. of variation. Bars are paler when their sign agreement is not significantly ($p < 0.05$) greater than chance.
  • Figure 5: News article template without populist framing.
  • ...and 3 more figures