Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide
TL;DR
The paper tackles the challenge of stable facet-level personality control in role-playing LLMs under long-context dialogue, where traditional training-free approaches risk prompt drift and training-based methods incur data and compute costs. It introduces a Contrastive Sparse AutoEncoder (SAE) that learns facet-aligned Control Vectors (CVs) from a leakage-controlled Big Five 30-facet corpus and injects them into the model's residual space via h'(x) = h(x) + α v, guided by a contrastive loss and regularization. An Agent-Based Decision Module selects the most relevant CVs per turn, enabling trait-activated routing that minimizes interference and noise. Across two backbones, Qwen-3-4B and Mistral-7B, CV-SAE and especially CV-SAE+Prompt outperform CV-CAA and prompt-only baselines in FA, MSE, MAE, and MTR, demonstrating improved persona fidelity with preserved dialogue quality. This approach provides a scalable, interpretable inference-time solution for fine-grained RPA persona control without retraining, with potential to adapt to new roles by expanding the facet corpus and continuing to leverage latent-space steering.
Abstract
Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model's residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.
