Designing Role Vectors to Improve LLM Inference Behaviour
Daniele Potertì, Andrea Seveso, Fabio Mercorio
TL;DR
This paper introduces role vectors—latent activation-space directions extracted from internal model activations—as a mechanism to steer LLMs toward domain-specific expertise, offering an alternative to traditional persona-based prompting. By constructing 29 role directions from difference-of-means activations and applying two interventions, activation addition and directional ablation, the authors demonstrate measurable improvements on domain-relevant benchmarks, with limited impact on unrelated tasks. The approach relies on a three-part pipeline: generating role-specific prompts from PersonaHub and Alpaca to form $D_r$ and $D_{ ext{base}}$, computing directions $d_{i,r}^{(l)} = \mu_{i,r}^{(l)} - \nu_i^{(l)}$, and evaluating interventions across layers $l$, positions $i$, and coefficients $\alpha$ on the MMLU-style test set $\\mathcal{D}_{\\text{test}}$. Results show that larger models exhibit stronger, more interpretable directional signals, and that only a subset of improving directions align with the intended role according to patch-scoping; ablation studies reveal that removing these directions can degrade domain-specific performance while sometimes sparing or slightly improving unrelated tasks. These findings indicate that manipulating internal representations can more effectively steer performance than persona-based prompting, with significant implications for controllable LLM behavior and future mechanistic analyses using Activation Patching.
Abstract
The influence of personas on Large Language Models (LLMs) has been widely studied, yet their direct impact on performance remains uncertain. This work explores a novel approach to guiding LLM behaviour through role vectors, an alternative to persona-based prompting. We construct 29 role vectors derived from model activations and evaluate their impact on benchmark performance across multiple domains. Our analysis investigates whether these vectors can effectively steer models toward domain-specific expertise. We measure two key interventions: (i) activation addition, which reinforces role-specific directions, and (ii) directional ablation, which removes them. Results on well-established benchmarks indicate that role vectors do, in fact, influence model behaviour, improving task performance in relevant domains while marginally affecting unrelated tasks. This, in turn, suggests that manipulating internal model representations has a greater impact on outcomes than persona-based prompting.
