Table of Contents
Fetching ...

Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Yin Jou Huang, Rafik Hadfi

TL;DR

This paper tackles biases in self-reported personality assessments of large language models (LLMs) by introducing a multi-observer informant-report framework. It deploys subject and observer agents with varied relationship contexts to co-create dialogue-based scenarios, which are subsequently rated via a 50-item IPIP-style questionnaire covering the Big Five traits ($OPE$, $CON$, $EXT$, $AGR$, $NEU$). Key findings show that observer-reports align more closely with human judgments than self-reports and reveal systematic self-report biases in LLMs, especially for $AGR$ and $CON$; aggregating 5–7 observers optimizes reliability and mitigates individual biases. The framework emphasizes context sensitivity and demonstrates that observer diversity and relationship type significantly shape personality perception, offering a more robust and practical approach for evaluating LLM personalities in human-AI interactions.

Abstract

Self-report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta-knowledge contamination. This paper proposes a novel multi-observer framework for personality trait assessments in LLM agents that draws on informant-report methods in psychology. Instead of relying on self-assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer-report ratings align more closely with human judgments than traditional self-reports and reveal systematic biases in LLM self-assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi-observer paradigm offers a more reliable, context-sensitive approach to evaluating LLM personality traits.

Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

TL;DR

This paper tackles biases in self-reported personality assessments of large language models (LLMs) by introducing a multi-observer informant-report framework. It deploys subject and observer agents with varied relationship contexts to co-create dialogue-based scenarios, which are subsequently rated via a 50-item IPIP-style questionnaire covering the Big Five traits (, , , , ). Key findings show that observer-reports align more closely with human judgments than self-reports and reveal systematic self-report biases in LLMs, especially for and ; aggregating 5–7 observers optimizes reliability and mitigates individual biases. The framework emphasizes context sensitivity and demonstrates that observer diversity and relationship type significantly shape personality perception, offering a more robust and practical approach for evaluating LLM personalities in human-AI interactions.

Abstract

Self-report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta-knowledge contamination. This paper proposes a novel multi-observer framework for personality trait assessments in LLM agents that draws on informant-report methods in psychology. Instead of relying on self-assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer-report ratings align more closely with human judgments than traditional self-reports and reveal systematic biases in LLM self-assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi-observer paradigm offers a more reliable, context-sensitive approach to evaluating LLM personality traits.

Paper Structure

This paper contains 41 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview: multi-observer LLM agents for Big Five personality assessment.
  • Figure 2: Observer report ratings across different latent personality strength levels.
  • Figure 3: Spearman's Rank Correlation coefficients between latent-observer and self-observer ratings as a function of the number of observers for each Big Five trait.
  • Figure 4: Mean differences between observer and self-reports across Big Five personality traits by relationship context. The orange line represents the median, while the green dotted line shows the mean. Relationships with statistically significant differences ($p-value < 0.05$) are highlighted with asterisks ($*$).
  • Figure 5: Difference of observer-report and self-report in each Big Five personality dimension for different models and prompt variations. Asterisks indicate differences that are statistically significant (*: p < 0.05, **: p< 0.1).