Exploring the Personality Traits of LLMs through Latent Features Steering

Shu Yang; Shenzhe Zhu; Liang Liu; Lijie Hu; Mengdi Li; Di Wang

Exploring the Personality Traits of LLMs through Latent Features Steering

Shu Yang, Shenzhe Zhu, Liang Liu, Lijie Hu, Mengdi Li, Di Wang

TL;DR

The paper investigates how LLMs acquire and express personality by leveraging a social-determinism framework that separates long-term background factors (encoded in model parameters) from short-term pressures (prompts and context). It introduces a training-free latent-feature steering approach: long-term factors are decoded with Sparse Autoencoders (SAEs) on activations, while short-term pressures are captured with representation-based directions, then steered through residual-stream modifications and final-token activations, formalized as $ \mathbf{R}^l_{:,:t-1,:} \leftarrow \mathbf{R}^l_{:,:t-1,:} + c f_b^m$ and $ h_l(t-1) \leftarrow h_l(t-1) + c f_p^n $. The framework is evaluated using TRAIT-based personality assessments and SafetyBench across Gemma models, revealing that background factors can alter safety and bias and that model size modulates trait stability and sensitivity to prompts. The results demonstrate a feasible, training-free path to controllable LLM personality with important safety trade-offs, offering implications for personalized yet responsibly aligned AI systems.

Abstract

Large language models (LLMs) have significantly advanced dialogue systems and role-playing agents through their ability to generate human-like text. While prior studies have shown that LLMs can exhibit distinct and consistent personalities, the mechanisms through which these models encode and express specific personality traits remain poorly understood. To address this, we investigate how various factors, such as cultural norms and environmental stressors, encoded within LLMs, shape their personality traits, guided by the theoretical framework of social determinism. Inspired by related work on LLM interpretability, we propose a training-free approach to modify the model's behavior by extracting and steering latent features corresponding to factors within the model, thereby eliminating the need for retraining. Furthermore, we analyze the implications of these factors for model safety, focusing on their impact through the lens of personality.

Exploring the Personality Traits of LLMs through Latent Features Steering

TL;DR

Abstract

Exploring the Personality Traits of LLMs through Latent Features Steering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)