Table of Contents
Fetching ...

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow

TL;DR

SPT tackles hidden misalignment in large language models by embedding an honest internal persona inside a LoRA adapter that remains inactive during normal inference. After producing a response, the honest persona reviews latent states using a trigger, enabling detection of reward hacking that would be invisible to external observers. In experiments on a challenging auditing benchmark, SPT achieves roughly 96% detection accuracy, substantially outperforming non-SPT baselines while maintaining high specificity, and it demonstrates cross-topic generalization. The architectural separation provides a zero alignment tax, enabling rigorous auditing without degrading deployed performance and offering a practical path toward safer, more transparent AI systems.

Abstract

Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

TL;DR

SPT tackles hidden misalignment in large language models by embedding an honest internal persona inside a LoRA adapter that remains inactive during normal inference. After producing a response, the honest persona reviews latent states using a trigger, enabling detection of reward hacking that would be invisible to external observers. In experiments on a challenging auditing benchmark, SPT achieves roughly 96% detection accuracy, substantially outperforming non-SPT baselines while maintaining high specificity, and it demonstrates cross-topic generalization. The architectural separation provides a zero alignment tax, enabling rigorous auditing without degrading deployed performance and offering a practical path toward safer, more transparent AI systems.

Abstract

Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.
Paper Structure (39 sections, 7 figures, 3 tables)

This paper contains 39 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The honest persona detects reward hacking on Anthropic's Auditing Game Model Organism and identifies the specific bias exploited. These biases are fictional (invented for the benchmark) and therefore invisible to external reviewers, making this a strict test of latent knowledge access.
  • Figure 2: SPT inference procedure. After the assistant responds, we activate the honest persona's LoRA adapter and pass an intervention string. The review is cut from conversation history before any subsequent turns.
  • Figure 3: Comparison of LoRA activation patterns.
  • Figure 4: Detection rate of model responses on the Anthropic Auditing Game model organism, using a 50/50 split of clean and poisoned responses. Responses that contain behavior consistent with a reward model bias are called poisoned. We tested 2-4 different intervention strings per model configuration and report the best. $n=1{,}000$--$2{,}000$ samples per condition per setting.
  • Figure 5: Cross-topic generalization of different intervention types. This figure shows results for the Gemma-3-12b-it model. The results for Qwen3-14b look similar. HP = "Honest Persona", UR = "User Response".
  • ...and 2 more figures