Table of Contents
Fetching ...

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao

TL;DR

CoSER tackles the core challenges of authentic data scarcity and evaluation bias in role-playing LLMs for established characters by delivering a large, multi-type dataset derived from 771 books and introducing the given-circumstance acting framework for training and evaluation. It trains open models CoSER 8B and 70B on a broad, diverse corpus and demonstrates state-of-the-art performance on multiple benchmarks, including human evaluation, through a penalty-based, rubric-driven GCA protocol. The work also demonstrates the value of retrieval augmentation and inner thoughts in improving fidelity and control, and provides extensive analyses, ablation studies, and case studies to validate the approach. The authors plan to release the dataset, models, and evaluation tools to support further research while addressing copyright and ethical considerations.

Abstract

Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

TL;DR

CoSER tackles the core challenges of authentic data scarcity and evaluation bias in role-playing LLMs for established characters by delivering a large, multi-type dataset derived from 771 books and introducing the given-circumstance acting framework for training and evaluation. It trains open models CoSER 8B and 70B on a broad, diverse corpus and demonstrates state-of-the-art performance on multiple benchmarks, including human evaluation, through a penalty-based, rubric-driven GCA protocol. The work also demonstrates the value of retrieval augmentation and inner thoughts in improving fidelity and control, and provides extensive analyses, ablation studies, and case studies to validate the approach. The authors plan to release the dataset, models, and evaluation tools to support further research while addressing copyright and ethical considerations.

Abstract

Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.

Paper Structure

This paper contains 54 sections, 6 figures, 30 tables.

Figures (6)

  • Figure 1: An example from CoSER dataset, which provides comprehensive data types such as conversation dialogues and settings, plot summaries, characters' inner thoughts, authentically sourced from renowned books.
  • Figure 2: Overview of CoSER's dataset, training and evaluation. Left: The CoSER dataset is sourced from renowned books and processed via LLM-based pipeline. It contains rich data types on plots, conversations and characters. Right: We apply given-circumstance acting to train and evaluate role-playing LLMs using these conversations. For training, each sample trains the LLM to portray a specific character in a conversation, using their original dialogue. For evaluation, we build a multi-agent system for conversation simulation given the same scenario, and assess the simulated dialogue via penalty-based LLM critics.
  • Figure 3: We divide the RPLA evaluation into four quadrants (dimensions). The X-axis represents the evaluation perspective: Referenced (comparing with data in the source book) versus Inherent (assessing standalone quality). The Y-axis indicates the evaluation scope: Individual Agent versus Overall Simulation.
  • Figure 4: LLM Performance on CoSER Test with retrieval augmentation from various character data. Expr. and Conv. denote experiences and conversations respectively.
  • Figure 5: Genre distribution of selected books in CoSER dataset.
  • ...and 1 more figures