SocialBench: Sociality Evaluation of Role-Playing Conversational Agents
Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, Jingren Zhou
TL;DR
SocialBench addresses the gap in evaluating social intelligence of role-playing conversational agents, extending beyond character fidelity to assess social interaction at both individual and group levels. It introduces a large, multi-source dataset (500 characters, 6,000 questions, 30,800 utterances) and a three-stage construction pipeline (profile collection, dialogue construction, question design) with rigorous pre- and post-validation. The study evaluates diverse open- and closed-source LLMs, finding that group-level sociality is more challenging and that performance on individual-level tasks does not predict group performance; memory and group dynamics pose notable limitations. The benchmark provides a publicly available testbed and analytical insights to spur future work in social intelligence for role-playing agents and their deployment in multi-agent contexts.
Abstract
Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce SocialBench, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on SocialBench confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench.
