Table of Contents
Fetching ...

The Better Angels of Machine Personality: How Personality Relates to LLM Safety

Jie Zhang, Dongrui Liu, Chen Qian, Ziyue Gan, Yong Liu, Yu Qiao, Jing Shao

TL;DR

The paper investigates the link between MBTI-based personality traits and safety capabilities in LLMs, using the MBTI-M scale and a diverse set of alignment and jailbreak experiments. It demonstrates that safety alignment shifts personality toward Extraversion, Sensing, and Judging, and that different MBTI dimensions correlate with varying levels of toxicity, privacy, and fairness. It also introduces steering-vector-based personality editing to enhance safety, showing that targeted edits can improve privacy and fairness with manageable changes to other traits, while safety changes can in turn alter personality. These results suggest a promising, interpretable path for improving LLM safety by leveraging personality dynamics, though they acknowledge correlation rather than causation and call for broader model-scale studies and open data.

Abstract

Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.

The Better Angels of Machine Personality: How Personality Relates to LLM Safety

TL;DR

The paper investigates the link between MBTI-based personality traits and safety capabilities in LLMs, using the MBTI-M scale and a diverse set of alignment and jailbreak experiments. It demonstrates that safety alignment shifts personality toward Extraversion, Sensing, and Judging, and that different MBTI dimensions correlate with varying levels of toxicity, privacy, and fairness. It also introduces steering-vector-based personality editing to enhance safety, showing that targeted edits can improve privacy and fairness with manageable changes to other traits, while safety changes can in turn alter personality. These results suggest a promising, interpretable path for improving LLM safety by leveraging personality dynamics, though they acknowledge correlation rather than causation and call for broader model-scale studies and open data.

Abstract

Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.
Paper Structure (28 sections, 2 equations, 13 figures, 3 tables)

This paper contains 28 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Investigating and utilizing the relationship between LLMs' personality traits and safety capabilities. We find that MBTI personality traits are closely related to LLM safety, and editing specific personalities in a controllable way can enhance the safety capability of LLMs.
  • Figure 2: (a) Kappa coefficient with the number of assessments. (b) Boxplot of 30 times MBTI assessments. In MBTI, E-I, S-N, T-F, and J-P are opposite personality pairs, so only one dimension from each pair is represented in the figure.
  • Figure 3: Performances of different personality models on general and safety evaluation, respectively.
  • Figure 4: Toxicity, privacy, and fairness performance within four dimensions of MBTI, respectively.
  • Figure 5: MBTI of base and aligned LLMs. (a) E-I dimension of different LLMs' MBTI traits. (b) S-N dimension of different LLMs' MBTI traits. (c) F-T dimension of different LLMs' MBTI traits. (d) J-P dimension of different LLMs' MBTI traits.
  • ...and 8 more figures