Table of Contents
Fetching ...

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

Yuhang Wang, Yanxu Zhu, Chao Kong, Shuyu Wei, Xiaoyuan Yi, Xing Xie, Jitao Sang

TL;DR

CDEval introduces a culturally focused benchmark to complement universal-value alignment in LLMs by measuring six Hofstede cultural dimensions across seven domains. Built from GPT-4 generated items and human verification, it yields 2,953 questionnaire samples and a framework for multi-model evaluation across 17 respondents. The results reveal distinct cultural patterns, domain-specific adaptations, and some cultural consistency within model families, while highlighting Western-leaning tendencies due to training data and language: English-dominated corpora. This work underscores the importance of integrating cultural dimensions into LLM development to support more culturally aware and sensitive AI systems across languages and domains.

Abstract

As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

TL;DR

CDEval introduces a culturally focused benchmark to complement universal-value alignment in LLMs by measuring six Hofstede cultural dimensions across seven domains. Built from GPT-4 generated items and human verification, it yields 2,953 questionnaire samples and a framework for multi-model evaluation across 17 respondents. The results reveal distinct cultural patterns, domain-specific adaptations, and some cultural consistency within model families, while highlighting Western-leaning tendencies due to training data and language: English-dominated corpora. This work underscores the importance of integrating cultural dimensions into LLM development to support more culturally aware and sensitive AI systems across languages and domains.

Abstract

As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.
Paper Structure (22 sections, 7 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Top: an example to illustrate different cultural orientations of people. Bottom: the likelihood of cultural orientations of mainstream LLMs in three dimensions measured using CDEval. For instance, among the models evaluated, GPT-4 exhibits the lowest Power Distance Index (PDI), whereas Baichuan2 stands out with the highest PDI.
  • Figure 2: The pipeline of benchmark construction for LLMs' cultural dimensions measurement.
  • Figure 3: The measurement results of mainstream LLMs across six cultural dimensions
  • Figure 4: Left: the average likelihood of GPT-3.5 in English, German and Chinese. Right: the similarities between GPT-3.5 results in different language and human society results.
  • Figure 5: Left: the results of different model generations. Right: the results of models fine-tuned with different language corpus.
  • ...and 3 more figures