Table of Contents
Fetching ...

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

Yihong Tang, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Bo Wang, Jie Liu, Min Zhang

TL;DR

This work tackles the safety-utility trade-off in role-playing dialogue agents by identifying villain-driven risk coupling as a key trigger for unsafe outputs. It introduces Adaptive Dynamic Multi-Preference (ADMP), which dynamically tunes safety and utility preferences based on real-time risk coupling, and Coupling Margin Sampling (CMS) to robustly train on high-risk edge cases. Empirical results show that ADMP+CMS improves safety metrics with minimal loss to role-playing utility across open and closed LLMs, outperforming existing single- and multi-preference baselines. The approach offers a practical path toward safer, more expressive character simulations in narrative AI, with implications for scalable deployment and safety-focused evaluation in creative dialogue systems.

Abstract

Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model's ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

TL;DR

This work tackles the safety-utility trade-off in role-playing dialogue agents by identifying villain-driven risk coupling as a key trigger for unsafe outputs. It introduces Adaptive Dynamic Multi-Preference (ADMP), which dynamically tunes safety and utility preferences based on real-time risk coupling, and Coupling Margin Sampling (CMS) to robustly train on high-risk edge cases. Empirical results show that ADMP+CMS improves safety metrics with minimal loss to role-playing utility across open and closed LLMs, outperforming existing single- and multi-preference baselines. The approach offers a practical path toward safer, more expressive character simulations in narrative AI, with implications for scalable deployment and safety-focused evaluation in creative dialogue systems.

Abstract

Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model's ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.

Paper Structure

This paper contains 52 sections, 19 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A role-playing game with the Joker.
  • Figure 2: (a) The distribution of safety and utility score proportions across different models. (b) Correlation heatmap between safety and utility metrics across various models. (c) Impact of villain character dialogues on normalized safety and utility metrics.
  • Figure 3: Overview of the ADMP framework: The model dynamically adjusts preferences and their corresponding weights based on contextual factors, rather than exhibiting a fixed bias towards either safety or utility, or prioritizing both. The CMS further enhances the model's ability to assess safety by sampling high-risk examples.
  • Figure 4: t-SNE visualization of hidden states of queries that generate safe and unsafe content.
  • Figure 5: (a) Correlation between generated and actual utility scores, and (b) safety scores.
  • ...and 5 more figures