Table of Contents
Fetching ...

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang, Zhixi Liu, Mengyu Wang

Abstract

Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Abstract

Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.

Paper Structure

This paper contains 29 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Pipeline overview. We generate emotion-conditioned inputs via canonical prepended cues and context-aligned variants, evaluate them with frozen LLMs across diverse benchmarks, and analyze performance sensitivity across models, tasks, and emotions. We also include adaptive per-instance emotion selection with EmotionRL.
  • Figure 2: EmotionRL pipeline for adaptive emotion selection.a, Offline training. For each training instance $(x_i, y_i)$ from the benchmark datasets, EmotionRL first computes a frozen semantic state $s_i = f_{\mathrm{emb}}(x_i)$. It then enumerates all candidate emotions $a_k \in \mathcal{A}$, constructs the corresponding emotion-conditioned prompts, and queries the frozen backbone LLM under each condition. This produces a grouped reward vector $\mathbf{r}_i = [r_i^{(1)}, \dots, r_i^{(K)}]$, where $r_i^{(k)} = \mathbf{1}[\hat{y}_i^{(k)} = y_i]$. The reward vector is converted into soft supervision $w_i^{(k)}$ and cached as an offline reward dataset $\{(s_i, \mathbf{r}_i)\}_{i=1}^N$. A policy network is then trained to predict $\pi_\theta(a_k \mid s_i)$ by minimizing the reward-weighted cross-entropy objective in Eq. (2). b, Online inference. For an unseen input $x$, EmotionRL computes $s = f_{\mathrm{emb}}(x)$, applies the trained policy $\pi_\theta(a_k \mid s)$, and selects $a^* = \arg\max_{a_k \in \mathcal{A}} \pi_\theta(a_k \mid s)$. The selected emotion $a^*$ is then used to construct a single emotion-conditioned prompt, which is submitted once to the frozen backbone LLM to obtain the final prediction $\hat{y}$.
  • Figure 3: Effect of static emotional prefixes across benchmark tasks. Accuracy change relative to the matched no-emotion prompt for six prepended emotions across six benchmarks and three backbone models. Each bar isolates the effect of emotional framing while keeping the underlying question unchanged. Most deltas remain close to zero, showing that static emotional prompting usually acts as a mild perturbation rather than a strong performance modifier. The largest dispersion appears in socially grounded settings and a small number of harder reasoning conditions, where the same emotion can help one model and hurt another.
  • Figure 4: Effect of emotional intensity on MedQA-US. Accuracy delta relative to the no-emotion baseline as the intensity of a prepended emotion injection statement increases from slight to extreme. All three models remain close to zero across the full range of intensities. Stronger affective wording produces only a mild downward trend and does not induce an abrupt failure regime, indicating that emotional intensity changes the magnitude of the perturbation without qualitatively changing task behavior.
  • Figure 5: MedQA-US: human versus LLM emotion injection. Accuracy of Qwen3-14B on the held-out subset under the no-emotion baseline and six emotion conditions, comparing LLM-generated prefixes with human-written prefixes. The two sources produce closely matched accuracies across conditions, and the small differences do not consistently favor one source over the other. The qualitative effect of emotional framing is therefore robust to how the prefix is authored.
  • ...and 1 more figures