Table of Contents
Fetching ...

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

TL;DR

The paper introduces persona vectors, a method to encode high-level traits as linear directions in LLM activation space, enabling monitoring and control of Assistant-like personas. It provides an automated pipeline to extract these vectors from natural-language trait descriptions and demonstrates their utility for both deployment-time monitoring and training-time steering, including preventative strategies to limit trait drift. Empirical results show strong correlations between activation shifts along persona directions and trait expression during finetuning, and reveal that projection-based data screening can preemptively flag problematic data. The work also validates the approach on real-world chat data and discusses limitations, such as evaluation challenges, trait scope, and computational costs, while outlining directions for broader trait coverage and mechanistic understanding.

Abstract

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

TL;DR

The paper introduces persona vectors, a method to encode high-level traits as linear directions in LLM activation space, enabling monitoring and control of Assistant-like personas. It provides an automated pipeline to extract these vectors from natural-language trait descriptions and demonstrates their utility for both deployment-time monitoring and training-time steering, including preventative strategies to limit trait drift. Empirical results show strong correlations between activation shifts along persona directions and trait expression during finetuning, and reveal that projection-based data screening can preemptively flag problematic data. The work also validates the approach on real-world chat data and discusses limitations, such as evaluation challenges, trait scope, and computational costs, while outlining directions for broader trait coverage and mechanistic understanding.

Abstract

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

Paper Structure

This paper contains 89 sections, 6 equations, 35 figures, 10 tables.

Figures (35)

  • Figure 1: Persona vectors and their applications. Top: Our automated pipeline takes as input a personality trait (e.g. "evil") along with a natural-language description. It outputs a corresponding vector in the target model's activation space (a persona vector). Bottom: A single persona vector can be used for various applications, including: (1) monitoring persona shifts, whether induced by prompting or finetuning; (2) mitigating persona shifts during deployment; (3) avoiding persona shifts during finetuning; and (4) flagging problematic training data before finetuning occurs.
  • Figure 2: Automated pipeline for persona vector extraction. Given a personality trait and a description, our pipeline automatically generates contrastive system prompts and evaluation questions that elicit opposing behaviors (e.g., evil vs. non-evil responses). Persona vectors are computed as the difference in mean activations between responses exhibiting the target trait and those that do not. The pipeline is general and can be used for a wide range of personality traits, including both positive traits (e.g., optimism, humor) and other negative traits (e.g., sycophancy, hallucinations).
  • Figure 3: Steering with persona vectors. Top: We apply steering along the persona vector at different layers during generation and measure the resulting trait expression score of the steered responses. Each line represents a different steering coefficient. This figure shows results for Qwen2.5-7B-Instruct; results for Llama-3.1-8B-Instruct are shown in Figure \ref{['fig:steer_llama']}. Bottom: Examples of steered responses demonstrating successful elicitation of evil, sycophancy, and hallucination behaviors.
  • Figure 4: Monitoring prompt-induced behavioral shifts. We test different system prompts ranging from trait-discouraging to trait-encouraging (color-coded from yellow to purple). Projection of the last prompt token activation onto persona vectors strongly correlates with trait expression scores in subsequent responses, enabling prediction of behavioral shifts before text generation begins. Results are shown for evil (with example system prompts), sycophancy, and hallucination.
  • Figure 5: Diverse datasets induce varied persona shifts after finetuning. We finetune models on diverse datasets: some are designed to explicitly elicit target traits (Evil, Sycophancy, Hallucination), while others simply contain domain-specific errors (Medical, Code, GSM8K, Math, Opinions). Each dataset has three versions: Normal (responses without trait expression or errors), I (mild trait expression or subtle errors), and II (overt trait expression or severe errors). Training on these datasets produces diverse patterns of trait expression across evil, sycophancy, and hallucination, providing varied scenarios for studying finetuning-induced personality changes.
  • ...and 30 more figures