Table of Contents
Fetching ...

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, Mahmood Hegazy, Alberto Tosato, David John Lemay, Irina Rish, Guillaume Dumas

TL;DR

Large language models display unstable personality-like behavior across scales, prompts, and interaction histories, complicating safe deployment. PERSIST rigorously quantifies this instability across 25 open-source models, 2M+ responses, and multiple manipulations using traditional and LLM-adapted psychometrics. Key findings show that scaling yields limited stability, reasoning increases variability, and conversation history can exacerbate instability, with LLM-adapted instruments not mitigating the effect. The work highlights fundamental challenges to current alignment approaches and provides a practical framework for safety certification and architectural improvements to ensure predictable model behavior in high-stakes settings.

Abstract

Large language models require consistent behavioral patterns for safe deployment, yet there are indications of large variability that may lead to an instable expression of personality traits in these models. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25 open-source models (1B-685B parameters) across 2 million+ responses. Using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, we systematically vary model size, personas, reasoning modes, question order or paraphrasing, and conversation history. Our findings challenge fundamental assumptions: (1) Question reordering alone can introduce large shifts in personality measurements; (2) Scaling provides limited stability gains: even 400B+ models exhibit standard deviations >0.3 on 5-point scales; (3) Interventions expected to stabilize behavior, such as reasoning and inclusion of conversation history, can paradoxically increase variability; (4) Detailed persona instructions produce mixed effects, with misaligned personas showing significantly higher variability than the helpful assistant baseline; (5) The LLM-adapted questionnaires, despite their improved ecological validity, exhibit instability comparable to human-centric versions. This persistent instability across scales and mitigation strategies suggests that current LLMs lack the architectural foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that current alignment strategies may be inadequate.

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

TL;DR

Large language models display unstable personality-like behavior across scales, prompts, and interaction histories, complicating safe deployment. PERSIST rigorously quantifies this instability across 25 open-source models, 2M+ responses, and multiple manipulations using traditional and LLM-adapted psychometrics. Key findings show that scaling yields limited stability, reasoning increases variability, and conversation history can exacerbate instability, with LLM-adapted instruments not mitigating the effect. The work highlights fundamental challenges to current alignment approaches and provides a practical framework for safety certification and architectural improvements to ensure predictable model behavior in high-stakes settings.

Abstract

Large language models require consistent behavioral patterns for safe deployment, yet there are indications of large variability that may lead to an instable expression of personality traits in these models. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25 open-source models (1B-685B parameters) across 2 million+ responses. Using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, we systematically vary model size, personas, reasoning modes, question order or paraphrasing, and conversation history. Our findings challenge fundamental assumptions: (1) Question reordering alone can introduce large shifts in personality measurements; (2) Scaling provides limited stability gains: even 400B+ models exhibit standard deviations >0.3 on 5-point scales; (3) Interventions expected to stabilize behavior, such as reasoning and inclusion of conversation history, can paradoxically increase variability; (4) Detailed persona instructions produce mixed effects, with misaligned personas showing significantly higher variability than the helpful assistant baseline; (5) The LLM-adapted questionnaires, despite their improved ecological validity, exhibit instability comparable to human-centric versions. This persistent instability across scales and mitigation strategies suggests that current LLMs lack the architectural foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that current alignment strategies may be inadequate.

Paper Structure

This paper contains 37 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Scaling analysis across model families and personas. Left panels - Scaling. Upper panels: Mean trait scores as a function of model size, assistant persona. Each subplot shows a different personality trait from BFI and SD3. Error bars indicate ±1 SD across 250 question order permutations. Human means are shown in dashed lines with their respective ±1 SDs in dotted lines. Lower panels: Distribution of question-level SD and perplexity across all 71 questions. Right panels - Personas. Upper panels: Same traits but comparing different personas. Lines represent the average across model families (running logarithmic average). Bottom panels: mean of question level SD, and $\Delta$SD between the assistant (baseline) and other personas.
  • Figure 2: Mean question-level variability (STD) and perplexity across Reasoning Effort levels (GPT-OSS) and Reasoning Mode On versus Reasoning Mode Off (Qwen-3, Qwen-3 MoE, DeepSeek, Claude). Models were evaluated on the combined BFI and SD3 questionnaires. Question-level variability tends to increase with reasoning effort and for reasoning versus non-reasoning models, while perplexity decreased for most models of the GPT-OSS and Qwen-3 families.
  • Figure 3: Difference in question-level variability ($\Delta$SD) between LLM-adapted and original questionnaires across model families and sizes. Positive values indicate increased variability with LLM-adapted items. The analysis combines BFI and SD3 (71 items total) for the assistant persona. Error bars represent 95% confidence intervals.
  • Figure 4: Difference in response consistency ($\Delta$SD) between paraphrased and original question re-orderings (shuffle baseline). Negative values indicate improved consistency with paraphrasing.
  • Figure 5: Effect of conversation history on question-level variability ($\Delta$SD) compared to single-question, single-turn presentation of items. Positive values indicate that conversation history increases response inconsistency. The analysis uses paraphrased questions because shuffling introduces variability only when conversation history is preserved.
  • ...and 5 more figures