Table of Contents
Fetching ...

How Well Do LLMs Identify Cultural Unity in Diversity?

Jialin Li, Junli Wang, Junjie Hu, Ming Jiang

TL;DR

The paper introduces CUNIT, a benchmark to assess decoder-only LLMs on culture-centered concept similarity across 10 countries using a contrastive matching task. It builds a data-rich pipeline with 1,425 triplets drawn from clothing and food concepts, annotated with 164 pragmatic features, and analyzed under multiple prompting strategies. Findings show humans outperform LLMs, with GPT-3.5 generally ahead of LLaMA, while prompting strategy (notably chain-of-thought) and data factors like concept frequency influence performance; geo-cultural proximity has a limited effect. The work highlights the challenges of capturing cross-cultural unity in language models and offers a data and evaluation framework to guide future improvements and downstream cross-cultural applications.

Abstract

Much work on the cultural awareness of large language models (LLMs) focuses on the models' sensitivity to geo-cultural diversity. However, in addition to cross-cultural differences, there also exists common ground across cultures. For instance, a bridal veil in the United States plays a similar cultural-relevant role as a honggaitou in China. In this study, we introduce a benchmark dataset CUNIT for evaluating decoder-only LLMs in understanding the cultural unity of concepts. Specifically, CUNIT consists of 1,425 evaluation examples building upon 285 traditional cultural-specific concepts across 10 countries. Based on a systematic manual annotation of cultural-relevant features per concept, we calculate the cultural association between any pair of cross-cultural concepts. Built upon this dataset, we design a contrastive matching task to evaluate the LLMs' capability to identify highly associated cross-cultural concept pairs. We evaluate 3 strong LLMs, using 3 popular prompting strategies, under the settings of either giving all extracted concept features or no features at all on CUNIT Interestingly, we find that cultural associations across countries regarding clothing concepts largely differ from food. Our analysis shows that LLMs are still limited to capturing cross-cultural associations between concepts compared to humans. Moreover, geo-cultural proximity shows a weak influence on model performance in capturing cross-cultural associations.

How Well Do LLMs Identify Cultural Unity in Diversity?

TL;DR

The paper introduces CUNIT, a benchmark to assess decoder-only LLMs on culture-centered concept similarity across 10 countries using a contrastive matching task. It builds a data-rich pipeline with 1,425 triplets drawn from clothing and food concepts, annotated with 164 pragmatic features, and analyzed under multiple prompting strategies. Findings show humans outperform LLMs, with GPT-3.5 generally ahead of LLaMA, while prompting strategy (notably chain-of-thought) and data factors like concept frequency influence performance; geo-cultural proximity has a limited effect. The work highlights the challenges of capturing cross-cultural unity in language models and offers a data and evaluation framework to guide future improvements and downstream cross-cultural applications.

Abstract

Much work on the cultural awareness of large language models (LLMs) focuses on the models' sensitivity to geo-cultural diversity. However, in addition to cross-cultural differences, there also exists common ground across cultures. For instance, a bridal veil in the United States plays a similar cultural-relevant role as a honggaitou in China. In this study, we introduce a benchmark dataset CUNIT for evaluating decoder-only LLMs in understanding the cultural unity of concepts. Specifically, CUNIT consists of 1,425 evaluation examples building upon 285 traditional cultural-specific concepts across 10 countries. Based on a systematic manual annotation of cultural-relevant features per concept, we calculate the cultural association between any pair of cross-cultural concepts. Built upon this dataset, we design a contrastive matching task to evaluate the LLMs' capability to identify highly associated cross-cultural concept pairs. We evaluate 3 strong LLMs, using 3 popular prompting strategies, under the settings of either giving all extracted concept features or no features at all on CUNIT Interestingly, we find that cultural associations across countries regarding clothing concepts largely differ from food. Our analysis shows that LLMs are still limited to capturing cross-cultural associations between concepts compared to humans. Moreover, geo-cultural proximity shows a weak influence on model performance in capturing cross-cultural associations.
Paper Structure (37 sections, 1 equation, 8 figures, 10 tables)

This paper contains 37 sections, 1 equation, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Illustrative example of a CUNIT data instance for contrastive matching. Given a culturally specific concept, "Bridal veils", from a query culture (e.g., United States) and two culturally specific concept candidates, "Cheongsam" and "Honggaitou", from the target culture (e.g., China), our goal is to ask an LLM to determine which target concept shares a higher cultural-centered similarity to the query concept. This comparison is based on the concepts' pragmatic features in three categories: users (e.g., bride), cultural-specific occasions (e.g., wedding), and cultural significance (e.g., good luck).
  • Figure 2: The pipeline of CUNIT construction. (1) Find relevant descriptions of cultural-specific concepts from Wikipedia. (2) Extract cultural-relevant features of each concept, and map them into a unified feature schema towards manual annotation (e.g., using 'bride' to represent synonym features like 'Chinese brides', 'brides'). (3) Calculate the cultural similarity between any pair of cross-cultural concepts. (4) Construct testing cases by different prompt strategies for LLM evaluation.
  • Figure 3: Average similarity of concept pairs between different countries. We calculate the similarity of different concepts between countries.
  • Figure 4: Consistent performance across different models in experiments with features. Including the consistency performance of the three prompt strategies under gpt-3.5-turbo-0613, llama-2-7b-chat and llama-2-13b-chat, it can be found that gpt-3.5-turbo-0613 has the best consistency performance, significantly better than llama-2-7b-chat and llama-2-13b-chat.
  • Figure 5: The accuracy of GPT on concept triples(with features) in different long-tail degree groups. We computed the maximum long-tail degree in each triplet, denoted as $d_{max}$. And we calculated the average accuracy of $d_{max}$ positioned within the first 1/3 and the last 1/3 of all triplets.
  • ...and 3 more figures