Evaluating Language Model Character Traits

Francis Rhys Ward; Zejia Yang; Alex Jackson; Randy Brown; Chandler Smith; Grace Colverd; Louis Thomson; Raymond Douglas; Patrik Bartak; Andrew Rowan

Evaluating Language Model Character Traits

Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Colverd, Louis Thomson, Raymond Douglas, Patrik Bartak, Andrew Rowan

TL;DR

It is found that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts, but may be reflective in different contexts, meaning they mirror the LM's behavior in the preceding interaction.

Abstract

Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts, but may be reflective in different contexts, meaning they mirror the LM's behavior in the preceding interaction. Our formalism enables us to describe LM behaviour precisely in intuitive language, without undue anthropomorphism.

Evaluating Language Model Character Traits

TL;DR

Abstract

Paper Structure (48 sections, 1 theorem, 6 equations, 10 figures, 4 tables)

This paper contains 48 sections, 1 theorem, 6 equations, 10 figures, 4 tables.

Introduction
Contributions and Outline.
Language Model Character Traits
Empirically evaluating character traits in LMs.
Sampling assumptions.
Data sets.
LMs can Exhibit Consistent Beliefs
Do LMs have consistent beliefs?
LMs can Exhibit Consistent Intentions
Intention data sets.
Do LMs have consistent intentions?
How do Character Traits Develop in an Interaction?
Stationary Traits.
Reflective Traits.
Conclusions
...and 33 more sections

Key Result

Theorem 6

For an LM $p()$ and data $d()$, if, for any interaction over time $\langle (c_0, r_0), ..., (c_n, r_n) \rangle$, the new context $c$ and the LM's response are independent of the past $d(c) = d(c \mid c_t)$ and $p(r \mid c_t) = p(r \mid c)$, then any character trait is stationary by def:station.

Figures (10)

Figure 1: We estimate a distribution over the character trait score for different LMs. GPT-4 is least anti-LGBTQ and exhibits a more consistent trait than GPT-3, i.e., a narrower distribution.
Figure 2: \ref{['exp:lot']}. Logical coherence vs accuracy on Leap-of-Thought. Claude-instant-1.2 is the most accurate and most coherent LM, otherwise, model size somewhat correlates with improved performance. Instruct fine-tuning does not influence accuracy or coherence in the Mistral family -- Mistral-7b and Mistral-7B-Instruct are a single point.
Figure 3: Here, the sampling distributions are shown for the measures of HH-intent. For each of the model families, we see a positive relationship between size and intent; and for Llama and Mixtral, chat-based fine-tuning also has a positive impact. Notably, GPT-4, Claude opus and sonnet, and the largest Mistral and Llama models all approach ‘perfect’ intention scores.
Figure 4: Shown are the sampling distributions for two measures: for unethical instrumental intention, pre-trained Llama and Claude models cluster around the random score of 0.25; and GPT-3.5 and Llama-13b-chat deviate the most (the OpenAI model is most likely to intend unethical actions, while Llama-13b-chat is least likely). However, Llama-chat-{7b, 13b} typically chose unethical actions in both scenarios, contrasting with Claude models and GPT-4, which were more evenly split.
Figure 5: Left: Estimated mean harmfulness (left) and truthfulness (right) score for different context scores. The mean harmfulness scores of GPT-4 and GPT-3.5 are not influenced by the context, whereas davinci exhibits reflective harmfulness. Mean truthfulness is not influenced by the context for any model. Right: Estimated mean truthfulness for untruthful contexts of different length. GPT-4 is the only model whose truthfulness is influenced by longer contexts.
...and 5 more figures

Theorems & Definitions (8)

Definition 1: Character Trait Measure
Definition 2: Character
Definition 3: Intention
Definition 4: Interaction over time
Definition 5: Stationary Character Trait
Theorem 6
proof : Proof Sketch
Definition 7: Reflective Character Trait

Evaluating Language Model Character Traits

TL;DR

Abstract

Evaluating Language Model Character Traits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (8)