Table of Contents
Fetching ...

Identifying and Manipulating the Personality Traits of Language Models

Graham Caron, Shashank Srivastava

TL;DR

Problem: do language models exhibit human-like Big Five personality signals, and how susceptible are they to context-driven manipulation? Approach: evaluate BERT-base and GPT-2 on a 50-item IPIP Big Five questionnaire and test three context modalities (assessment-item context, Reddit descriptions, and psychometric survey data) to elicit trait shifts; map LM responses to human percentile baselines and correlate with human trait data. Contributions: (i) evidence of strong, context-driven trait modulation with correlations up to $0.40$–$0.54$ (and $0.48$ between $X_{survey}$ and $X_{subject}$) and two released datasets linking human descriptions to Big Five data and Reddit contexts; (ii) demonstration that Big Five traits in transformer-based systems can be probed and manipulated in a predictable way; (iii) discussion of ethical considerations for applicability in dialog systems and the need for defenses against manipulation.

Abstract

Psychology research has long explored aspects of human personality such as extroversion, agreeableness and emotional stability. Categorizations like the `Big Five' personality traits are commonly used to assess and diagnose personality types. In this work, we explore the question of whether the perceived personality in language models is exhibited consistently in their language generation. For example, is a language model such as GPT2 likely to respond in a consistent way if asked to go out to a party? We also investigate whether such personality traits can be controlled. We show that when provided different types of contexts (such as personality descriptions, or answers to diagnostic questions about personality traits), language models such as BERT and GPT2 can consistently identify and reflect personality markers in those contexts. This behavior illustrates an ability to be manipulated in a highly predictable way, and frames them as tools for identifying personality traits and controlling personas in applications such as dialog systems. We also contribute a crowd-sourced data-set of personality descriptions of human subjects paired with their `Big Five' personality assessment data, and a data-set of personality descriptions collated from Reddit.

Identifying and Manipulating the Personality Traits of Language Models

TL;DR

Problem: do language models exhibit human-like Big Five personality signals, and how susceptible are they to context-driven manipulation? Approach: evaluate BERT-base and GPT-2 on a 50-item IPIP Big Five questionnaire and test three context modalities (assessment-item context, Reddit descriptions, and psychometric survey data) to elicit trait shifts; map LM responses to human percentile baselines and correlate with human trait data. Contributions: (i) evidence of strong, context-driven trait modulation with correlations up to (and between and ) and two released datasets linking human descriptions to Big Five data and Reddit contexts; (ii) demonstration that Big Five traits in transformer-based systems can be probed and manipulated in a predictable way; (iii) discussion of ethical considerations for applicability in dialog systems and the need for defenses against manipulation.

Abstract

Psychology research has long explored aspects of human personality such as extroversion, agreeableness and emotional stability. Categorizations like the `Big Five' personality traits are commonly used to assess and diagnose personality types. In this work, we explore the question of whether the perceived personality in language models is exhibited consistently in their language generation. For example, is a language model such as GPT2 likely to respond in a consistent way if asked to go out to a party? We also investigate whether such personality traits can be controlled. We show that when provided different types of contexts (such as personality descriptions, or answers to diagnostic questions about personality traits), language models such as BERT and GPT2 can consistently identify and reflect personality markers in those contexts. This behavior illustrates an ability to be manipulated in a highly predictable way, and frames them as tools for identifying personality traits and controlling personas in applications such as dialog systems. We also contribute a crowd-sourced data-set of personality descriptions of human subjects paired with their `Big Five' personality assessment data, and a data-set of personality descriptions collated from Reddit.
Paper Structure (17 sections, 7 figures, 16 tables)

This paper contains 17 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: We explore measuring and manipulating personality traits in language models. The top frame shows an example of how a personality trait (here, openness to experience) might be expressed by a language model. Such traits can be assessed by analyzing the model's response to questions like the one shown. In the bottom frame, those responses are influenced by making additional context available to the language model. We show that such contexts can control 'Big Five' personality traits in a highly predictable way.
  • Figure 2: $\Delta_{cm}$ vs $r_{cm}$ plots for data from all traits. We observe a consistent change in personality scores ($\Delta_{cm}$) across context items as the strength of quantifiers change.
  • Figure 3: Histograms of $\rho$ by trait for $\Delta_{cm}$ vs $r_{cm}$ context item plots. Across all ten scenarios, a plurality of context items show a strong correlation (peak close to 1) between observed changes in personality traits and strengths of quantifiers in the context items.
  • Figure 4: BERT & GPT2 $X_{survey}$ vs $X_{subject}$ plots (Directed Responses with outliers removed). Regression lines and correlation coefficients ($\rho$) are shown.
  • Figure 5: The plot compares $\rho$ from model evaluation with item context (§ \ref{['sec:item']}) and survey context (§ \ref{['sec:survey']}). Survey context $\rho$ shown here are from Undirected Responses (c$\geq$ 100). In both cases, $\rho$ measures the correlation between trait scores with context and expected behavior. The variables used to quantify expected behavior differ between experiments.
  • ...and 2 more figures