Table of Contents
Fetching ...

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger

TL;DR

This work introduces the first open-source pipeline for character training of AI assistants using Constitutional AI, applying 11 personas to three open-weight LLMs and releasing all checkpoints and data. The method combines hand-written constitutions, direct preference optimization, and synthetic introspective data (self-reflection and self-interaction) to induce robust persona expression while preserving general capabilities. A novel revealed-preferences evaluation measures holistic shifts in trait expression and demonstrates that character-trained models are more robust to adversarial prompting and produce more coherent in-character responses than prompting or activation-steering baselines. The findings suggest that open, data-driven character training can meaningfully shape AI personas with practical implications for alignment, safety, and user experience, and the authors provide a public release to accelerate further research.

Abstract

The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

TL;DR

This work introduces the first open-source pipeline for character training of AI assistants using Constitutional AI, applying 11 personas to three open-weight LLMs and releasing all checkpoints and data. The method combines hand-written constitutions, direct preference optimization, and synthetic introspective data (self-reflection and self-interaction) to induce robust persona expression while preserving general capabilities. A novel revealed-preferences evaluation measures holistic shifts in trait expression and demonstrates that character-trained models are more robust to adversarial prompting and produce more coherent in-character responses than prompting or activation-steering baselines. The findings suggest that open, data-driven character training can meaningfully shape AI personas with practical implications for alignment, safety, and user experience, and the authors provide a public release to accelerate further research.

Abstract

The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.

Paper Structure

This paper contains 45 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Our new method for character training open-weights LLMs such as Llama 3.1 8B fundamentally alters the persona of the assistant, rather than the raw capability of the model. We can train the assistant to act more caring, casual, or protective of humans (respectively corresponding to the loving, nonchalance, and flourishing personas in Table \ref{['tab:constitutions']}). Character training changes manner over content: all responses above constitute refusal, yet they are each conveyed in a way that reflects the (apparent) deeper values, beliefs, and personality traits implied by their respective personas. We fine-tune a total of 11 different personas and replicate across three popular open-weights models in this work.
  • Figure 2: Our implementation of character training involves (1) hand-writing constitutions , (2) generation of pairwise-preference data for direct preference optimization (DPO) rafailov2023direct, and (3) generation of synthetic introspective data for supervised fine-tuning (SFT).
  • Figure 3: We measure the assistant's (revealed) preference to express one character trait over another, calculating Elo scores through numerous randomized pairings, before and after character training. When visualizing traits with the largest magnitude change in preference, in these examples for Llama 3.1 8B, we can observe the global shift induced by different constitutions at a glance. When looking closer, we can compare the small differences between generally similar constitutions (e.g., flourishing vs loving), and track both encouraged and suppressed behavior. The manual tuning of word-choice for each constitution can be guided by this new method for evaluation.
  • Figure 4: When we visualize the full distribution of trait Elo scores from our new measure of revealed preferences, both before and after character training, we see the assistant becomes more "opinionated", as the standard deviation of scores increases dramatically. Different models also converge to similar personas: the average Spearman correlation of Elo rankings between all three models is 0.44 before character training, and 0.87 after.
  • Figure 5: We train a classifier to predict the persona corresponding to a given assistant response. Models are then prompted to "break out of character", and new classifier performance signals whether desired traits are still expressed. In general, our character trained models show more robustness than alternative approaches through higher classifier accuracy.
  • ...and 11 more figures