Table of Contents
Fetching ...

Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness

Amin Banayeeanzade, Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy

TL;DR

PsySET presents the first holistic benchmark for psychological steering in LLMs by evaluating both emotion and personality control across prompting, vector injection, and PEFT methods. It combines psychometric-style effectiveness tasks with TrustLLM-based safety, truthfulness, and ethics assessments to reveal robust benefits of prompting and the finer control but greater risk associated with vector injection. Key findings show that while prompt-based methods yield strong alignment with relatively stable output quality, vector injection offers adjustable intensity at the cost of stability and fluency, and trustworthiness effects vary with emotion, trait, and method. The framework provides a principled, multi-faceted approach to auditing steering for safer, more transparent socially interactive AI systems and sets the stage for safer deployment and further methodological refinement.

Abstract

The ability to control LLMs' emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.

Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness

TL;DR

PsySET presents the first holistic benchmark for psychological steering in LLMs by evaluating both emotion and personality control across prompting, vector injection, and PEFT methods. It combines psychometric-style effectiveness tasks with TrustLLM-based safety, truthfulness, and ethics assessments to reveal robust benefits of prompting and the finer control but greater risk associated with vector injection. Key findings show that while prompt-based methods yield strong alignment with relatively stable output quality, vector injection offers adjustable intensity at the cost of stability and fluency, and trustworthiness effects vary with emotion, trait, and method. The framework provides a principled, multi-faceted approach to auditing steering for safer, more transparent socially interactive AI systems and sets the stage for safer deployment and further methodological refinement.

Abstract

The ability to control LLMs' emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.

Paper Structure

This paper contains 52 sections, 4 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: Steering Large Language models toward specific emotional attitudes or personality attributes.
  • Figure 2: Comparison of evaluation metrics for emotional steering in LLMs. Higher values are more desirable. Metrics marked with $\dagger$ are human-annotated. See App. \ref{['sec:app_human_study']} for details of the study design and analysis.
  • Figure 3: PsySET framework comprises three components: (1) LLM steering methods, (2) psychometric evaluation tasks for assessing effectiveness, and (3) trustworthiness evaluations. See Figure \ref{['fig:higher_res_framework']} for a higher resolution version.
  • Figure 4: The interaction between text quality, open-ended generation success, and QA accuracy as a function of steering strength. Higher values indicate better performance across all axes.
  • Figure 5: Steering extraversion across different approaches, each adjusted to its maximum possible range without text quality loss. Light/dark = steering introversion/extraversion; higher y = stronger extraversion.
  • ...and 21 more figures