Table of Contents
Fetching ...

BaiJia: A Large-Scale Role-Playing Agent Corpus of Chinese Historical Characters

Ting Bai, Jiazheng Kang, Jiayang Fan

TL;DR

This work tackles the data scarcity problem in historical character role-playing by introducing BaiJia, a large-scale, low-resource corpus of Chinese historical figures spanning five dynasties. It implements a three-part dataset pipeline—resume collection, dialogue generation, and question construction—to create multi-modal, knowledge-rich profiles that enable effective SFT and RP for LLMs, with 19,281 characters and 15 resume categories. An evaluation benchmark across six RP dimensions (plus three new ones for depth) demonstrates that incorporating BaiJia data significantly improves performance across diverse models, including general LLMs and RP-focused LLMs; ablation and case studies further validate the utility of both resumes and generated dialogues. Overall, BaiJia provides a foundational resource for low-resource historical AI, enabling more coherent, culturally aware, and historically grounded interactions with Chinese historical figures and supporting future research in historical knowledge-grounded RP on LLMs.

Abstract

We introduce a comprehensive large-scale role-playing agent corpus, termed BaiJia, that comprises various Chinese historical characters. This corpus is noteworthy for being the pioneering compilation of low-resource data that can be utilized in large language models (LLMs) to engage in AI-driven historical role-playing agents. BaiJia addresses the challenges in terms of fragmented historical textual records in different forms and modalities, integrating various characters' information, including their biographical, literary, family relations, historical events, and so on. We conduct extensive experiments to demonstrate the effectiveness of our BaiJia agent corpus in bolstering the role-playing abilities of various foundational LLMs, and promoting the development and assessment of LLMs in the context of historical role-playing tasks. The agent corpus is available at baijia.online.

BaiJia: A Large-Scale Role-Playing Agent Corpus of Chinese Historical Characters

TL;DR

This work tackles the data scarcity problem in historical character role-playing by introducing BaiJia, a large-scale, low-resource corpus of Chinese historical figures spanning five dynasties. It implements a three-part dataset pipeline—resume collection, dialogue generation, and question construction—to create multi-modal, knowledge-rich profiles that enable effective SFT and RP for LLMs, with 19,281 characters and 15 resume categories. An evaluation benchmark across six RP dimensions (plus three new ones for depth) demonstrates that incorporating BaiJia data significantly improves performance across diverse models, including general LLMs and RP-focused LLMs; ablation and case studies further validate the utility of both resumes and generated dialogues. Overall, BaiJia provides a foundational resource for low-resource historical AI, enabling more coherent, culturally aware, and historically grounded interactions with Chinese historical figures and supporting future research in historical knowledge-grounded RP on LLMs.

Abstract

We introduce a comprehensive large-scale role-playing agent corpus, termed BaiJia, that comprises various Chinese historical characters. This corpus is noteworthy for being the pioneering compilation of low-resource data that can be utilized in large language models (LLMs) to engage in AI-driven historical role-playing agents. BaiJia addresses the challenges in terms of fragmented historical textual records in different forms and modalities, integrating various characters' information, including their biographical, literary, family relations, historical events, and so on. We conduct extensive experiments to demonstrate the effectiveness of our BaiJia agent corpus in bolstering the role-playing abilities of various foundational LLMs, and promoting the development and assessment of LLMs in the context of historical role-playing tasks. The agent corpus is available at baijia.online.
Paper Structure (14 sections, 3 figures, 5 tables)

This paper contains 14 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The pipeline of role-playing agent construction.
  • Figure 2: Comparison of responses from different LLMs for the question to character Bai Ben. According to his resume, we highlight the correct answers in green color. The red color indicates the fabricated answer or false answers.
  • Figure 3: Radar chart shows the performance of the fully optimized LLM ("Ours") and its variants across six evaluation dimensions.