Table of Contents
Fetching ...

CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

Truong Vo, Sanmi Koyejo

TL;DR

This work introduces CURE, a thick-cultural evaluation framework for LLMs that situates normative reasoning within realistic, persona-driven scenarios across 145 countries and 30,000+ items. By pairing thin accuracy with four diagnostic, free-form reasoning metrics (Coverage, Specificity, Connotation, Coherence) and validating an automated judge against human ratings, the study demonstrates that traditional surface-level benchmarks overestimate cultural competence and fail to reveal underlying reasoning gaps. The four metrics capture orthogonal dimensions of cultural understanding, exposing model weaknesses that vary by dataset and prompting, and revealing that larger models do not uniformly solve all reasoning challenges. The results advocate for evaluating cultural alignment through reasoning proficiencies rather than raw accuracy, with practical implications for model development, data curation, and deployment in cross-cultural contexts.

Abstract

Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

TL;DR

This work introduces CURE, a thick-cultural evaluation framework for LLMs that situates normative reasoning within realistic, persona-driven scenarios across 145 countries and 30,000+ items. By pairing thin accuracy with four diagnostic, free-form reasoning metrics (Coverage, Specificity, Connotation, Coherence) and validating an automated judge against human ratings, the study demonstrates that traditional surface-level benchmarks overestimate cultural competence and fail to reveal underlying reasoning gaps. The four metrics capture orthogonal dimensions of cultural understanding, exposing model weaknesses that vary by dataset and prompting, and revealing that larger models do not uniformly solve all reasoning challenges. The results advocate for evaluating cultural alignment through reasoning proficiencies rather than raw accuracy, with practical implications for model development, data curation, and deployment in cross-cultural contexts.

Abstract

Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

Paper Structure

This paper contains 13 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: SpecNorm construction pipeline from data sourcing to human-reviewed scenario generation.
  • Figure 2: F$_1$-Micro: Thin (blue) vs. thick (red) evaluation.
  • Figure 3: F$_1$-Macro: Thin (blue) vs. thick (red) evaluation.
  • Figure 4: Reasoning quality metrics (Coverage, Specificity, Connotation, Coherence) for each model.