Table of Contents
Fetching ...

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim

TL;DR

The paper tackles the challenge of reliably controlling implicit LLM personalities by introducing Activation-Space Personality Steering, which uses Big Five trait directions extracted from activations to steer outputs. It couples a low-rank subspace approach with a hybrid layer-selection strategy that combines offline verification of reliable layers with prompt-specific dynamic assessments, enabling stable bidirectional steering via forward-hook perturbations without weight updates. Key contributions include a per-trait low-rank subspace capturing most inter-trait energy, a four-part layer-selection pipeline (offline priors, dynamic selection, their hybrid, and intensity calibration), and polarity calibration to ensure correct directional effects, demonstrated across LLaMA-3-8B-Instruct and Ministral-8B-Instruct while preserving fluency and general capabilities. This work advances interpretable, controllable personalization for open-source models and highlights practical considerations for safety, reliability, and cross-architecture generalization in deployment scenarios.

Abstract

Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

TL;DR

The paper tackles the challenge of reliably controlling implicit LLM personalities by introducing Activation-Space Personality Steering, which uses Big Five trait directions extracted from activations to steer outputs. It couples a low-rank subspace approach with a hybrid layer-selection strategy that combines offline verification of reliable layers with prompt-specific dynamic assessments, enabling stable bidirectional steering via forward-hook perturbations without weight updates. Key contributions include a per-trait low-rank subspace capturing most inter-trait energy, a four-part layer-selection pipeline (offline priors, dynamic selection, their hybrid, and intensity calibration), and polarity calibration to ensure correct directional effects, demonstrated across LLaMA-3-8B-Instruct and Ministral-8B-Instruct while preserving fluency and general capabilities. This work advances interpretable, controllable personalization for open-source models and highlights practical considerations for safety, reliability, and cross-architecture generalization in deployment scenarios.

Abstract

Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.

Paper Structure

This paper contains 30 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Diagram illustrating our methodology in four phases. i. Extracting trait directions: Trait-wise mean difference vectors are computed from High/Low samples across layers. ii. PCA projection: The aggregated directions are projected onto a low-dimensional subspace to obtain orthogonal, normalised trait vectors. iii. Layer selection: The best injection layer is chosen using offline diagnostics (verified layer) combined with a prompt-specific dynamic layer for hybrid steering. iv. Steering at Inferrence: The final intensity-scaled trait vectors are injected into the residual stream of the selected Layers guided by polarity to produce personality-aligned text.
  • Figure 2: Diagrams representing the Hybrid Trait Selection for each of the personality traits used in our methods for LLama-3-8B-Instruct model. Dotted lines represent the Offline Prior/Offline Best method that serves as a static layer selection. The layer vs. Frequency plot demonstrates the choice of layers and frequency during the steering process. $\star$ represents the mean of dynamic layers selected across all runs.
  • Figure 3: Bar charts representing the High/Base/Low Traits and Fluency Scores with their corresponding variance scores.
  • Figure 4: Ablation: Results of steering in both positive and negative directions using the Hybrid vs. only Dynamic layers selection. $\Delta$ values represent the trait separations between High and Low values when using the Hybrid and with only Dynamic layer selection. Separations are significantly higher in the Hybrid layer selection method.