The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Yihong Tang, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Bo Wang, Jie Liu, Min Zhang
TL;DR
This work tackles the safety-utility trade-off in role-playing dialogue agents by identifying villain-driven risk coupling as a key trigger for unsafe outputs. It introduces Adaptive Dynamic Multi-Preference (ADMP), which dynamically tunes safety and utility preferences based on real-time risk coupling, and Coupling Margin Sampling (CMS) to robustly train on high-risk edge cases. Empirical results show that ADMP+CMS improves safety metrics with minimal loss to role-playing utility across open and closed LLMs, outperforming existing single- and multi-preference baselines. The approach offers a practical path toward safer, more expressive character simulations in narrative AI, with implications for scalable deployment and safety-focused evaluation in creative dialogue systems.
Abstract
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model's ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.
