Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
Takumi Ohashi, Hitoshi Iyatomi
TL;DR
The paper addresses the gap in evaluating sentence-level cultural specificity in multilingual LLMs by introducing the Conceptual Cultural Index (CCI), a metric that quantifies cultural specificity as the difference between a target culture's sentence generality and the average generality across other cultures: $CCI(x;t,{C}) = \bar{p}_t(x) - \frac{1}{|{C}|-1} \sum_{c\in{C}\setminus\{t\}} \bar{p}_c(x)$, with $CCI\in[-1,1]$. Generality scores $p_c(x)\in[0,1]$ are obtained from an LLM across cultures and averaged over $N$ runs: $\bar{p}_c(x) = \frac{1}{N} \sum_{n=1}^{N} f_{LLM}^{(n)}(x;C)[c]$. The framework is validated on Japanese as the target culture using 400 sentences (200 culture-specific, 200 general) across five LLMs, and demonstrates improved separability over direct scoring, as well as controllable cultural scope via Global vs Custom modes and neighboring culture configurations. Additionally, CCI-based stratification applied to JCQA and JCM reveals that higher cultural specificity generally reduces task accuracy, highlighting practical implications for culture-aware benchmarking and data curation. The work provides a practical, interpretable pipeline for evaluating and controlling cultural specificity in multimodal, multilingual contexts.
Abstract
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .
