Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger
TL;DR
This work introduces AnthroBench, a scalable, empirically grounded framework for evaluating anthropomorphic behaviours in large language models. It advances beyond single-turn benchmarks by deploying a multi-turn evaluation of 14 behaviours, automated user simulations, and a large-scale human validation (N=1,101) to link measured behaviours with real user perceptions. Evaluating four SOTA LLMs, it finds similar anthropomorphic profiles dominated by relationship-building and first-person pronoun use, with many behaviours emerging after several turns. The methodology provides a publicly available benchmarking tool and a rigorous foundation for understanding how design choices influence anthropomorphism, with implications for safety, user trust, and policy in human–AI interactions.
Abstract
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
