Table of Contents
Fetching ...

Reasoning Does Not Necessarily Improve Role-Playing Ability

Xiachong Feng, Longxu Dou, Lingpeng Kong

TL;DR

The paper interrogates whether reasoning techniques enhance the role-playing abilities of large language models by conducting a large-scale, standardized evaluation across six benchmarks and 24 LLMs with three prompting strategies. It uncovers that Chain-of-Thought can hinder performance, reasoning-optimized models are ill-suited for role-playing, and reasoning disrupts scaling laws, while the Qwen series excels and Chinese role-playing often surpasses English. The authors propose two future directions—role-aware CoT and reinforcement learning for role-playing—to improve persona consistency and adaptive behavior, and they deliver a standardized OpenCompass framework to enable reproducible research. The findings offer practical guidance for deploying role-playing LLMs and shape directions for integrating reasoning with character-driven AI systems.

Abstract

The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: "Can reasoning techniques enhance the role-playing capabilities of LLMs?" To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.

Reasoning Does Not Necessarily Improve Role-Playing Ability

TL;DR

The paper interrogates whether reasoning techniques enhance the role-playing abilities of large language models by conducting a large-scale, standardized evaluation across six benchmarks and 24 LLMs with three prompting strategies. It uncovers that Chain-of-Thought can hinder performance, reasoning-optimized models are ill-suited for role-playing, and reasoning disrupts scaling laws, while the Qwen series excels and Chinese role-playing often surpasses English. The authors propose two future directions—role-aware CoT and reinforcement learning for role-playing—to improve persona consistency and adaptive behavior, and they deliver a standardized OpenCompass framework to enable reproducible research. The findings offer practical guidance for deploying role-playing LLMs and shape directions for integrating reasoning with character-driven AI systems.

Abstract

The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: "Can reasoning techniques enhance the role-playing capabilities of LLMs?" To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.

Paper Structure

This paper contains 24 sections, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Performance comparison of 17 Models using two role-playing methods across six benchmarks. The horizontal axis ranks models in descending order of scale, while the vertical axis represents the unique metric for each dataset. Notably, darker colors indicate that zero-shot role-playing outperforms CoT role-playing, whereas lighter colors signify that zero-shot role-playing underperforms compared to CoT role-playing.
  • Figure 2: Experimental results of various models across 6 benchmarks. Models of similar sizes are represented using the same color scheme, with each employing different types of reasoning techniques. The vertical axis denotes the evaluation metrics specific to each dataset.
  • Figure 3: Performance comparison of different models across six benchmarks. The horizontal axis represents model size, arranged from smallest to largest, while the vertical axis denotes benchmark-specific evaluation metrics, where higher values indicate better role-playing performance. Within each benchmark, different color gradients represent the performance curves for its respective sub-datasets.
  • Figure 4: Fine-grained performance of the Qwen2.5 series on the CharacterEval benchmark. The radar chart illustrates multiple evaluation dimensions, with metrics computed using a pretrained reward model. Higher scores indicate stronger capabilities.