Table of Contents
Fetching ...

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers

Max Müller-Eberstein, Mike Zhang, Elisa Bassignana, Peter Brunsgaard Trolle, Rob van der Goot

TL;DR

This work addresses cultural misalignment in LLMs for Danish, a mid-resource language, by introducing DaKultur—the first native Danish cultural evaluation dataset. It compares three Danish-adapted LLMs and shows that training on native Danish data more than doubles human-judged cultural acceptance (from $14\%$ to $42\%$), while automatic benchmarks lag behind culturally aware performance. The study demonstrates that translated data alone are insufficient to confer deep cultural knowledge and highlights topic- and demographic-specific variation in acceptance. The dataset and findings provide a practical pathway for aligning Danish LLMs to cultural expectations and offer a methodological blueprint for similar evaluation in other mid-resource languages.

Abstract

Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers

TL;DR

This work addresses cultural misalignment in LLMs for Danish, a mid-resource language, by introducing DaKultur—the first native Danish cultural evaluation dataset. It compares three Danish-adapted LLMs and shows that training on native Danish data more than doubles human-judged cultural acceptance (from to ), while automatic benchmarks lag behind culturally aware performance. The study demonstrates that translated data alone are insufficient to confer deep cultural knowledge and highlights topic- and demographic-specific variation in acceptance. The dataset and findings provide a practical pathway for aligning Danish LLMs to cultural expectations and offer a methodological blueprint for similar evaluation in other mid-resource languages.

Abstract

Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

Paper Structure

This paper contains 22 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Demographic Statistics for Our 63 Study Participants, who were asked to optionally provide the region, where one grew up (\ref{['fig:demographics-regions']}), age range in decades (\ref{['fig:demographics-age']}), and gender identity (\ref{['fig:demographics-gender']}). 94% of respondents opted to provide this information.
  • Figure 2: Acceptance/Rejection Rates across SnakModel, Llama2-7Bchat+INSTda and Llama2-7Bbase+INSTda as judged by participants in DaKultur. Left: overall results; Right: results by topic.
  • Figure 3: Study Interface for Human Cultural Evaluation. Participants are guided through the guidelines (\ref{['fig:evaluation-interface-landing']}), optional demographic registration (\ref{['fig:evaluation-interface-demographics']}), before being asked to prompt the three LLMs simultaneously (\ref{['fig:evaluation-interface-prompting']}), and to evaluate the model responses (\ref{['fig:evaluation-interface-judgement']}). Translations of the guidelines, interface, and examples can be found in \ref{['app:translation']}.
  • Figure 4: Acceptance/Rejection Rates and Distribution across Topics for the female/male gender identity demographic groups.
  • Figure 5: Acceptance/Rejection Rates and Distribution across Topics for the age ranges $>=$ 29 and $<=$ 30.
  • ...and 1 more figures