Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models
Yin Jou Huang, Rafik Hadfi
TL;DR
This paper tackles biases in self-reported personality assessments of large language models (LLMs) by introducing a multi-observer informant-report framework. It deploys subject and observer agents with varied relationship contexts to co-create dialogue-based scenarios, which are subsequently rated via a 50-item IPIP-style questionnaire covering the Big Five traits ($OPE$, $CON$, $EXT$, $AGR$, $NEU$). Key findings show that observer-reports align more closely with human judgments than self-reports and reveal systematic self-report biases in LLMs, especially for $AGR$ and $CON$; aggregating 5–7 observers optimizes reliability and mitigates individual biases. The framework emphasizes context sensitivity and demonstrates that observer diversity and relationship type significantly shape personality perception, offering a more robust and practical approach for evaluating LLM personalities in human-AI interactions.
Abstract
Self-report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta-knowledge contamination. This paper proposes a novel multi-observer framework for personality trait assessments in LLM agents that draws on informant-report methods in psychology. Instead of relying on self-assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer-report ratings align more closely with human judgments than traditional self-reports and reveal systematic biases in LLM self-assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi-observer paradigm offers a more reliable, context-sensitive approach to evaluating LLM personality traits.
