Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
TL;DR
The paper introduces persona vectors, a method to encode high-level traits as linear directions in LLM activation space, enabling monitoring and control of Assistant-like personas. It provides an automated pipeline to extract these vectors from natural-language trait descriptions and demonstrates their utility for both deployment-time monitoring and training-time steering, including preventative strategies to limit trait drift. Empirical results show strong correlations between activation shifts along persona directions and trait expression during finetuning, and reveal that projection-based data screening can preemptively flag problematic data. The work also validates the approach on real-world chat data and discusses limitations, such as evaluation challenges, trait scope, and computational costs, while outlining directions for broader trait coverage and mechanistic understanding.
Abstract
Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.
