Table of Contents
Fetching ...

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu, Shufan Jiang, Mingyu Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Chengwei Qin

TL;DR

FURINA introduces a fully customizable, multi-agent RP benchmark framework (FURINA-Builder) that automatically constructs scalable RP benchmarks by coordinating a test character, a large character-scene pool, simulation, and a judge-driven selection mechanism. The resulting FURINA-Bench unifies established and synthesized characters in group dialogues, enabling fine-grained evaluation across five dimensions and across languages. Key findings show that while established characters and reasoning capabilities boost RP performance, they also increase hallucinations, revealing a Pareto frontier between performance and reliability across models and settings. The framework demonstrates strong dimension-selection reliability and improved separability over baselines, offering a solid foundation for evaluating and advancing RP capabilities in LLMs. The work highlights practical implications for designing scalable, adaptable RP benchmarks and points to future directions in data curation and instruction-aware reasoning to mitigate hallucinations while preserving performance.

Abstract

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

TL;DR

FURINA introduces a fully customizable, multi-agent RP benchmark framework (FURINA-Builder) that automatically constructs scalable RP benchmarks by coordinating a test character, a large character-scene pool, simulation, and a judge-driven selection mechanism. The resulting FURINA-Bench unifies established and synthesized characters in group dialogues, enabling fine-grained evaluation across five dimensions and across languages. Key findings show that while established characters and reasoning capabilities boost RP performance, they also increase hallucinations, revealing a Pareto frontier between performance and reliability across models and settings. The framework demonstrates strong dimension-selection reliability and improved separability over baselines, offering a solid foundation for evaluating and advancing RP capabilities in LLMs. The work highlights practical implications for designing scalable, adaptable RP benchmarks and points to future directions in data curation and instruction-aware reasoning to mitigate hallucinations while preserving performance.

Abstract

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

Paper Structure

This paper contains 67 sections, 9 equations, 12 figures, 23 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of FURINA-Builder100, 180, 2550, 60, 120. There are three components. (i) Character-scene pool: a data pool containing a large number of authentic dialogue scenarios. (ii) Simulation: the test character is passed into the scenario sampled from the pool and talk with the scene characters in it. (iii) Selection: for each test character turn, the pipeline queries responses from both source and base models, with the judge model determining the evaluation dimension and selecting the superior output. All items marked are customizable. More explanations are presented in Section \ref{['section:furinabuilder']}.
  • Figure 2: FURINA-Bench179,214,25560,77,133 Evaluation. For each test utterance, both the test model and the base model generate responses to the same prompt. Pairwise judgments with CoT analysis are then used to score the test response under the assigned evaluation dimension.
  • Figure 3: Role-playing hallucination rates (%) of Qwen3 with Synthesized-character and Established-character on Chinese section. Reasoning produces more serious hallucination.
  • Figure 4: Role-playing evaluation results across four models using GCA Evaluation and our FURINA-Bench179,214,25560,77,133 Evaluation. Our method illustrates more challenging with better separability.
  • Figure 5: Relationship between role-playing performance and reliability for Chinese established (left) and synthesized (right) characters. Reliability score is computed by 100 / (hallucination rate).
  • ...and 7 more figures