Table of Contents
Fetching ...

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, Jingren Zhou

TL;DR

SocialBench addresses the gap in evaluating social intelligence of role-playing conversational agents, extending beyond character fidelity to assess social interaction at both individual and group levels. It introduces a large, multi-source dataset (500 characters, 6,000 questions, 30,800 utterances) and a three-stage construction pipeline (profile collection, dialogue construction, question design) with rigorous pre- and post-validation. The study evaluates diverse open- and closed-source LLMs, finding that group-level sociality is more challenging and that performance on individual-level tasks does not predict group performance; memory and group dynamics pose notable limitations. The benchmark provides a publicly available testbed and analytical insights to spur future work in social intelligence for role-playing agents and their deployment in multi-agent contexts.

Abstract

Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce SocialBench, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on SocialBench confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench.

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

TL;DR

SocialBench addresses the gap in evaluating social intelligence of role-playing conversational agents, extending beyond character fidelity to assess social interaction at both individual and group levels. It introduces a large, multi-source dataset (500 characters, 6,000 questions, 30,800 utterances) and a three-stage construction pipeline (profile collection, dialogue construction, question design) with rigorous pre- and post-validation. The study evaluates diverse open- and closed-source LLMs, finding that group-level sociality is more challenging and that performance on individual-level tasks does not predict group performance; memory and group dynamics pose notable limitations. The benchmark provides a publicly available testbed and analytical insights to spur future work in social intelligence for role-playing agents and their deployment in multi-agent contexts.

Abstract

Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce SocialBench, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on SocialBench confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench.
Paper Structure (35 sections, 4 equations, 12 figures, 7 tables)

This paper contains 35 sections, 4 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: An example from SocialBench, which is partially constructed from the film "The Great Gatsby".
  • Figure 2: The three-step dataset construction pipeline of SocialBench.
  • Figure 3: Personality traits distribution in SocialBench.
  • Figure 4: Distribution of dialogue tokens across four dimensions in SocialBench, based on tokenizer of Qwen.
  • Figure 5: Performance w.r.t the number of utterances.
  • ...and 7 more figures