CultureLLM: Incorporating Cultural Differences into Large Language Models

Cheng Li; Mengzhou Chen; Jindong Wang; Sunayana Sitaram; Xing Xie

CultureLLM: Incorporating Cultural Differences into Large Language Models

Cheng Li, Mengzhou Chen, Jindong Wang, Sunayana Sitaram, Xing Xie

TL;DR

CultureLLM tackles cultural bias in LLMs by leveraging 50 World Values Survey seed samples and a novel semantic data augmentation pipeline to generate semantically equivalent training data. This enables fine-tuning of culture-specific models and a unified CultureLLM-One across 9 cultures, achieving strong gains over GPT-3.5 and Gemini Pro, with results comparable to or approaching GPT-4 across 60 culture-related datasets. A human study confirms semantic equivalence of augmented data, and supplementary analyses show robustness to forgetting and compatibility with open-source LLMs like Llama-2. The approach offers a cost-effective path to culturally aware LLMs, particularly for low-resource cultures, while acknowledging limitations and outlining societal implications.

Abstract

Large language models (LLMs) are reported to be partial to certain cultures owing to the training data dominance from the English corpora. Since multilingual cultural data are often expensive to collect, existing efforts handle this by prompt engineering or culture-specific pre-training. However, they might overlook the knowledge deficiency of low-resource culture and require extensive computing resources. In this paper, we propose CultureLLM, a cost-effective solution to incorporate cultural differences into LLMs. CultureLLM adopts World Value Survey (WVS) as seed data and generates semantically equivalent training data via the proposed semantic data augmentation. Using only 50 seed samples from WVS with augmented data, we fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages. Extensive experiments on 60 culture-related datasets demonstrate that CultureLLM significantly outperforms various counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with comparable performance to GPT-4 or even better. Our human study shows that the generated samples are semantically equivalent to the original samples, providing an effective solution for LLMs augmentation. Code is released at https://github.com/Scarelette/CultureLLM.

CultureLLM: Incorporating Cultural Differences into Large Language Models

TL;DR

Abstract

Paper Structure (55 sections, 2 equations, 8 figures, 12 tables)

This paper contains 55 sections, 2 equations, 8 figures, 12 tables.

Introduction
Related Work
Cultural Problem and Solution in LLMs
Data Augmentation for LLMs
CultureLLM
Overview
Sampling
Semantic Data Augmentation
Semantic Template Generation
Intact Sample Generation
Fine-tuning
Experiments
Setup
Main Results
Results on Open-ended Generation Tasks
...and 40 more sections

Figures (8)

Figure 1: Overview of CultureLLM. CultureLLM consists of three steps: sampling, semantic data augmentation, and fine-tuning. Culture-specific and unified CultureLLM can be fine-tuned.
Figure 2: Details of semantic data augmentation. First, semantic templates are generated via rephrasing, semantic filtering, and sentence parsing. Then, training samples are generated by context-aware synonyms replacement and semantic filtering.
Figure 3: (a) The main results averaged by cultures (left) and by tasks (right). Both CultureLLM and CultureLLM-One significantly outperform CultureLLM and Gemini with CultureLLM achieving the best performance comparable to GPT-4. (b) Ablation study. '+WVS' denotes the fine-tuned models using only the 50 samples from WVS, '+WVS+a' denotes fine-tuning using the WVS samples and the generated samples in step 1 of our data augmentation (i.e., using only GPT-4 to generate), and '+WVS+a+b' denotes the complete process of our algorithm.
Figure 4: (a) Results on different numbers of fine-tuning samples with perplexity score and diversity gain above. (b) Results of fine-tuneing on English (En ft) and local languages (local ft). It shows that fine-tuning on English outperforms fine-tuning on local languages.
Figure 5: (a) Analysis on catastrophic forgetting on BBH and GSM8K. The red line denotes the results of GPT-3.5. For BBH, we show the average results of $21$ tasks in this figure. The x-axis represents models and the y-axis represents performance. (b) CultureLLM-Llama-70b averaged by cultures (left) and by tasks (right), which outperforms the vanilla Llama model by $2.17\%$ on average.
...and 3 more figures

CultureLLM: Incorporating Cultural Differences into Large Language Models

TL;DR

Abstract

CultureLLM: Incorporating Cultural Differences into Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)