StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Guobin Shen; Dongcheng Zhao; Aorigele Bao; Xiang He; Yiting Dong; Yi Zeng

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Guobin Shen, Dongcheng Zhao, Aorigele Bao, Xiang He, Yiting Dong, Yi Zeng

TL;DR

The paper investigates whether stress affects Large Language Models (LLMs) similarly to humans by introducing StressPrompt, a dataset of 100 psychology-grounded prompts annotated for stress levels and evaluated across multiple benchmarks. It combines prompt engineering with a Representation Engineering–inspired Stress Scanner to quantify internal state changes (hidden states) and relate them to performance across tasks, following the Yerkes-Dodson law that moderate arousal optimizes performance. Key findings show that LLMs generally peak at moderate stress levels for reasoning, instruction following, and emotional intelligence, while high stress degrades certain faculties like bias detection; deeper layers exhibit stronger neural signatures of stress, especially on the last token. The work provides practical guidance for designing robust AI systems capable of maintaining high performance in real-world, stress-prone environments and contributes a framework for studying human-like cognitive dynamics in artificial agents.

Abstract

Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 10 figures, 1 table)

This paper contains 13 sections, 4 equations, 10 figures, 1 table.

Introduction
Related Works
Method
StressPrompt Construction
StressPrompt Evaluation
StressPrompt Analysis
Experiments
Experimental Setup
Analysis Under Varying Stress Levels
Impact of Stress on Emotional Intelligence, Bias, and Hallucination
Visualization of the Effect of Stress on Neural Activity
Conclusion
Acknowledgments

Figures (10)

Figure 1: Comparison of stress-level performance between LLMs and humans.
Figure 2: StressPrompt acts as a system instruction, simulating different environments and influencing the LLM's response. Left: Low stress level. Right: Moderately high stress level.
Figure 3: Design of StressPrompt based on psychological principles. Each category encompasses a range of stress-inducing scenarios, ensuring a comprehensive set of prompts for our study.
Figure 4: Distribution of participant scores on stress levels in StressPrompt. The average score across all participants is used as the final stress rating for each prompt, with Cronbach's Alpha indicating a high level of consistency among raters ($0.9947$, $p < 0.001$).
Figure 5: Stress scanner constructed with RepE on Meta-Llama-3-8B-Instruct. Various StressPrompts induce differences in the neural activity of LLMs, with the last token showing the most significant correlation with stress.
...and 5 more figures

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

TL;DR

Abstract

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)