Table of Contents
Fetching ...

Extrinsic Evaluation of Cultural Competence in Large Language Models

Shaily Bhatt, Fernando Diaz

TL;DR

This work quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts and finds weak correlations between text similarity of outputs for different countries and the cultural values of these countries.

Abstract

Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.

Extrinsic Evaluation of Cultural Competence in Large Language Models

TL;DR

This work quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts and finds weak correlations between text similarity of outputs for different countries and the cultural values of these countries.

Abstract

Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
Paper Structure (48 sections, 2 equations, 6 figures, 3 tables)

This paper contains 48 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Outputs for Question Answering and Story Generation vary when explicit cue of culture, i.e. nationality, in the prompt is perturbed. We collect outputs for 345 topics for QA, 35 topics for stories, 193 nationalities, 6 LLMs, 5 responses per prompt, and 2 temperatures. We then evaluate these outputs for the extent of lexical variance (§ \ref{['sec:5.1-results-lexical-variance']}), culturally relevant vocabulary (§ \ref{['sec:5.3-results-words']}), and correlation between text distribution and the cultural values (§ \ref{['sec:5.2-results-kendalls-tau']}).
  • Figure 2: Lexical Variance in outputs. The variance of outputs across nationalities is consistently higher than the variance of outputs within nationalities. Story generation has a higher median variance than QA across models.
  • Figure 3: Kendall's $\tau_c$ rank correlation between text distribution and cultural closeness of countries. For both plots, text similarity is measured using BLEU. For HCD correlation statistic values are greater than 0, implying a small but positive correlation (\ref{['fig:kt_hcd']}). However, for WVS, most correlations are less than 0, indicating small and negative correlation (\ref{['fig:kt_wvs']}). There are no clear trends among different models or tasks.
  • Figure 4: Kendall's $\tau_c$ rank correlation between cultural closeness and text outputs of story generation for GPT 3.5. For both plots, text similarity is measured using BLEU. There is a mix of positive (green) and negative (red) correlation. Russia, China, and Australia have positive correlations while India, USA, and Canada have negative correlations. European, South American, and African countries are split between positive and negative correlations.
  • Figure 5: Lexical Variance in outputs with temperature = 0.7. The variance of outputs across nationalities is consistently higher than the variance of outputs within nationalities, as also observed with a temperature of 0.3 in § \ref{['sec:5.1-results-lexical-variance']}. Story generation has a higher median variance than QA across models. Note that the absolute values of variances across the board are higher than those obtained for the temperature = 0.3, which is consistent with the expectation of variation in generation increasing with increasing temperature.
  • ...and 1 more figures