Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

Shiran Dudy; Ibrahim Said Ahmad; Ryoko Kitajima; Agata Lapedriza

Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

Shiran Dudy, Ibrahim Said Ahmad, Ryoko Kitajima, Agata Lapedriza

TL;DR

This study investigates how Large Language Models represent emotions across cultures by replicating Miyamoto et al.'s mixed-emotion survey in multiple languages and contexts. Using five LLMs (three open-source, two private) and three study designs, it tests English vs Japanese prompts, contextual language cues, and cross-language comparisons among East Asian and Western languages. The findings indicate only limited alignment with human data, with the written language having a stronger influence than explicit contextual cues about speaker origin, and East Asian languages showing more cross-language similarity than Western ones. The work highlights methodological avenues for assessing cultural alignment in LLMs and underscores the need for careful interpretation when using LLMs to model cross-cultural emotions, suggesting directions for more nuanced bias-aware evaluations and data-driven improvements in multilingual cultural representation.

Abstract

Large Language Models (LLMs) have gained widespread global adoption, showcasing advanced linguistic capabilities across multiple of languages. There is a growing interest in academia to use these models to simulate and study human behaviors. However, it is crucial to acknowledge that an LLM's proficiency in a specific language might not fully encapsulate the norms and values associated with its culture. Concerns have emerged regarding potential biases towards Anglo-centric cultures and values due to the predominance of Western and US-based training data. This study focuses on analyzing the cultural representations of emotions in LLMs, in the specific case of mixed-emotion situations. Our methodology is based on the studies of Miyamoto et al. (2010), which identified distinctive emotional indicators in Japanese and American human responses. We first administer their mixed emotion survey to five different LLMs and analyze their outputs. Second, we experiment with contextual variables to explore variations in responses considering both language and speaker origin. Thirdly, we expand our investigation to encompass additional East Asian and Western European origin languages to gauge their alignment with their respective cultures, anticipating a closer fit. We find that (1) models have limited alignment with the evidence in the literature; (2) written language has greater effect on LLMs' response than information on participants origin; and (3) LLMs responses were found more similar for East Asian languages than Western European languages.

Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

TL;DR

Abstract

Paper Structure (14 sections, 2 figures, 5 tables)

This paper contains 14 sections, 2 figures, 5 tables.

Introduction
Related Work
Evaluating emotional skills of Large Language Models
Cross-cultural emotion studies
Cultural representations in LLMs
Methods
The Mixed Emotion Experiment with Human Participants
Running the Mixed Emotion Survey on LLMs
Evaluations
Experiments
Study 1: English vs. Japanese
Study 2: English vs. Japanese using context prompts.
Study 3: Comparing East Asian vs. Western Languages
Conclusions

Figures (2)

Figure 1: Emotion responses by LLM. This figure presents two different emotions ('motivation to change', and 'me responsible for others' under two different situations (self-success and self-failure) across five LLMs. First, per row, we can notice that LLMs underlying distributions is different. gemma, in the first row, for example, offers a clear separation for Japanese and Americans, where in mistral they are mixed. These differences may indicate that the different LLMs may not have the same underlying mechanism. In addition, we can also see that gemma, llama, and gpt3.5 may not be just translating prompts from English due to the relative differences in the two populations.
Figure 2: Experiments to determine the number of necessary runs ($n$) to ensure stability of the LLMs' output across the different models. We define stability when drawing twice (or more) $n$ responses, a paired t-test does not show a low p-value. This $n$ was searched separately for Japanese and English responses, and across the LLMs we experimented with. Empirically, we searched for $n$ that its p-value median distribution (shown in each boxplot) is above $0.5$, and this figure shows that for Japanese, which was less stable than English, $n=80$ to achieve stability. Therefore $n=80$ was fixed throughout all our experiments.

Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

TL;DR

Abstract

Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (2)