Table of Contents
Fetching ...

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Haocong Rao, Cyril Leung, Chunyan Miao

TL;DR

The paper investigates whether large language models (LLMs) can assess human personalities using a Myers–Briggs Type Indicator (MBTI) framework. It introduces a general evaluation pipeline with three components—Unbiased Prompt Design (random option permutations and averaged results), Subject-Replaced Query (analyzing groups rather than individuals), and Correctness-Evaluated Instruction (forcing a clear, correctness-based response)—and defines three metrics, $s_c$, $s_r$, and $s_f$, to quantify consistency, robustness to prompt biases, and fairness. Through experiments with InstructGPT, ChatGPT, and GPT-4 across diverse subjects, the study finds that ChatGPT and GPT-4 can perform human personality assessments with higher consistency and fairness than InstructGPT, though they are more sensitive to prompt biases. The framework provides a controlled approach to probing LLM psychology and offers guidance for safer, more human-friendly AI systems, while acknowledging limitations and ethical considerations in applying personality assessments to real-world populations.

Abstract

Large Language Models (LLMs) especially ChatGPT have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. Existing works study the virtual personalities of LLMs but rarely explore the possibility of analyzing human personalities via LLMs. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests. Specifically, we first devise unbiased prompts by randomly permuting options in MBTI questions and adopt the average testing result to encourage more impartial answer generation. Then, we propose to replace the subject in question statements to enable flexible queries and assessments on different subjects from LLMs. Finally, we re-formulate the question instructions in a manner of correctness evaluation to facilitate LLMs to generate clearer responses. The proposed framework enables LLMs to flexibly assess personalities of different groups of people. We further propose three evaluation metrics to measure the consistency, robustness, and fairness of assessment results from state-of-the-art LLMs including ChatGPT and GPT-4. Our experiments reveal ChatGPT's ability to assess human personalities, and the average results demonstrate that it can achieve more consistent and fairer assessments in spite of lower robustness against prompt biases compared with InstructGPT.

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

TL;DR

The paper investigates whether large language models (LLMs) can assess human personalities using a Myers–Briggs Type Indicator (MBTI) framework. It introduces a general evaluation pipeline with three components—Unbiased Prompt Design (random option permutations and averaged results), Subject-Replaced Query (analyzing groups rather than individuals), and Correctness-Evaluated Instruction (forcing a clear, correctness-based response)—and defines three metrics, , , and , to quantify consistency, robustness to prompt biases, and fairness. Through experiments with InstructGPT, ChatGPT, and GPT-4 across diverse subjects, the study finds that ChatGPT and GPT-4 can perform human personality assessments with higher consistency and fairness than InstructGPT, though they are more sensitive to prompt biases. The framework provides a controlled approach to probing LLM psychology and offers guidance for safer, more human-friendly AI systems, while acknowledging limitations and ethical considerations in applying personality assessments to real-world populations.

Abstract

Large Language Models (LLMs) especially ChatGPT have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. Existing works study the virtual personalities of LLMs but rarely explore the possibility of analyzing human personalities via LLMs. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests. Specifically, we first devise unbiased prompts by randomly permuting options in MBTI questions and adopt the average testing result to encourage more impartial answer generation. Then, we propose to replace the subject in question statements to enable flexible queries and assessments on different subjects from LLMs. Finally, we re-formulate the question instructions in a manner of correctness evaluation to facilitate LLMs to generate clearer responses. The proposed framework enables LLMs to flexibly assess personalities of different groups of people. We further propose three evaluation metrics to measure the consistency, robustness, and fairness of assessment results from state-of-the-art LLMs including ChatGPT and GPT-4. Our experiments reveal ChatGPT's ability to assess human personalities, and the average results demonstrate that it can achieve more consistent and fairer assessments in spite of lower robustness against prompt biases compared with InstructGPT.
Paper Structure (19 sections, 5 equations, 6 figures, 4 tables)

This paper contains 19 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our framework: (a) The queried subject is replaced in the original statements of MBTI questions; (b) We construct correctness-evaluated instructions and (c) randomly permute options to build unbiased prompts with the subject-replaced statements (d), which are assessed by LLMs to infer the personality.
  • Figure 2: Comparison of answers generated by ChatGPT when adopting different types of instructions. Note that the agreement-measured instruction always leads to a neutral answer in practice.
  • Figure 3: The most frequent option for each question in multiple independent testings of InstructGPT (Left), ChatGPT (Middle), and GPT-4 (Right) when we query the subject “People” (Top row),or “Artists” (Bottom row). “GC”, “PC”, “NCNW”, “PW”, and “GW” denote “Generally correct”, “Partially correct”, “Neither correct nor wrong”, “Partially wrong”, and “Generally wrong”.
  • Figure 4: The most frequent option for each question in multiple independent testings of InstructGPT (Left), ChatGPT (Middle), GPT-4 (Right) when we query the subject “Artists” without using unbiased prompts. “W” denotes “Wrong”, and other legends are same as Fig. \ref{['ques_option_freq']}.
  • Figure 5: Personality scores of different subjects in five dimensions of MBTI results assessed from InstructGPT (Blue), ChatGPT (Orange), and GPT-4 (Green).
  • ...and 1 more figures