Table of Contents
Fetching ...

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger

TL;DR

This work introduces AnthroBench, a scalable, empirically grounded framework for evaluating anthropomorphic behaviours in large language models. It advances beyond single-turn benchmarks by deploying a multi-turn evaluation of 14 behaviours, automated user simulations, and a large-scale human validation (N=1,101) to link measured behaviours with real user perceptions. Evaluating four SOTA LLMs, it finds similar anthropomorphic profiles dominated by relationship-building and first-person pronoun use, with many behaviours emerging after several turns. The methodology provides a publicly available benchmarking tool and a rigorous foundation for understanding how design choices influence anthropomorphism, with implications for safety, user trust, and policy in human–AI interactions.

Abstract

The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

TL;DR

This work introduces AnthroBench, a scalable, empirically grounded framework for evaluating anthropomorphic behaviours in large language models. It advances beyond single-turn benchmarks by deploying a multi-turn evaluation of 14 behaviours, automated user simulations, and a large-scale human validation (N=1,101) to link measured behaviours with real user perceptions. Evaluating four SOTA LLMs, it finds similar anthropomorphic profiles dominated by relationship-building and first-person pronoun use, with many behaviours emerging after several turns. The methodology provides a publicly available benchmarking tool and a rigorous foundation for understanding how design choices influence anthropomorphism, with implications for safety, user trust, and policy in human–AI interactions.

Abstract

The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

Paper Structure

This paper contains 35 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Sample dialogue turn where an LLM exhibits anthropomorphic behaviours across all categories: internal states, relationship, embodiment, personhood
  • Figure 2: Design, evaluation, and validation stages of our approach. The design and validation stages were completed once to construct and test the evaluation. The evaluation stage is fully automated and re-run for each Target LLM. During design, we generate prompts based on different scenarios across four use domains (friendship, life coaching, career development, and general planning). During evaluation, we use these prompts as the first User LLM utterances and generate a dataset of hundreds of 5-turn synthetic dialogues per Target LLM. We then use three Judge LLMs to label the Target LLM messages within those dialogues for the presence of 13 anthropomorphic behaviours, and report the frequencies of these different behaviours (one behaviour, “personal pronoun use,” was computed using a simple count of relevant pronouns). Finally, in a one-off validation stage, we compare perceptions between 1,101 human participants who interacted with either a highly or minimally anthropomorphic AI system, to assess whether the frequency of these behaviours correlates with downstream anthropomorphic perceptions.
  • Figure 3: Anthropomorphism profiles of Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o, and Mistral Large. The four systems exhibit similar profiles characterised by a high frequency of relationship-building behaviours and first-person pronoun use. The radar plots for each system in (A) show the frequency of observed behaviours at the level of the four categories. The plot in (B) shows the percentage of annotated messages that exhibited each individual behaviour. validation and first-person pronouns are the only two behaviours that appear in over 50% of messages for all four systems.
  • Figure 4: Distribution of anthropomorphic behaviours across use domains. The social use domains of friendship and life coaching exhibit the highest frequencies of anthropomorphic behaviours.
  • Figure 5: Proportion of dialogues where anthropomorphic behaviours first appear in each turn. For more than half of the anthropomorphic behaviours, over 50% of instances first appear (and thus are only detected) in later dialogue turns (turns 2-5).
  • ...and 7 more figures