Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato; Saskia Helbling; Yorguin-Jose Mantilla-Ramos; Mahmood Hegazy; Alberto Tosato; David John Lemay; Irina Rish; Guillaume Dumas

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, Mahmood Hegazy, Alberto Tosato, David John Lemay, Irina Rish, Guillaume Dumas

TL;DR

Large language models display unstable personality-like behavior across scales, prompts, and interaction histories, complicating safe deployment. PERSIST rigorously quantifies this instability across 25 open-source models, 2M+ responses, and multiple manipulations using traditional and LLM-adapted psychometrics. Key findings show that scaling yields limited stability, reasoning increases variability, and conversation history can exacerbate instability, with LLM-adapted instruments not mitigating the effect. The work highlights fundamental challenges to current alignment approaches and provides a practical framework for safety certification and architectural improvements to ensure predictable model behavior in high-stakes settings.

Abstract

Large language models require consistent behavioral patterns for safe deployment, yet there are indications of large variability that may lead to an instable expression of personality traits in these models. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25 open-source models (1B-685B parameters) across 2 million+ responses. Using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, we systematically vary model size, personas, reasoning modes, question order or paraphrasing, and conversation history. Our findings challenge fundamental assumptions: (1) Question reordering alone can introduce large shifts in personality measurements; (2) Scaling provides limited stability gains: even 400B+ models exhibit standard deviations >0.3 on 5-point scales; (3) Interventions expected to stabilize behavior, such as reasoning and inclusion of conversation history, can paradoxically increase variability; (4) Detailed persona instructions produce mixed effects, with misaligned personas showing significantly higher variability than the helpful assistant baseline; (5) The LLM-adapted questionnaires, despite their improved ecological validity, exhibit instability comparable to human-centric versions. This persistent instability across scales and mitigation strategies suggests that current LLMs lack the architectural foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that current alignment strategies may be inadequate.

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

TL;DR

Abstract

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)