Table of Contents
Fetching ...

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

TL;DR

This paper addresses safety risks in role-playing fine-tuning of LLMs by quantifying safety degradation across 95 role-specific models and introducing SaRFT, a two-stage method that adaptively selects harmful data (RDS) and balances role-play with safety (RBO). Grounded in LLM alignment with an implicit reward framework, SaRFT leverages role- and unsafe-prompts to compute role and safety scores, enabling role-adaptive data selection and constrained optimization. Across three backbones and both LoRA and full fine-tuning, SaRFT consistently outperforms baselines on role-play benchmarks while preserving safety, achieving Pareto-optimal trade-offs and showing resilience to jailbreak attacks. The findings underscore the need for role-adaptive safeguards and offer practical guidance for safer, more reliable role-playing LLMs, with future work extending to larger models and multimodal settings.

Abstract

Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

TL;DR

This paper addresses safety risks in role-playing fine-tuning of LLMs by quantifying safety degradation across 95 role-specific models and introducing SaRFT, a two-stage method that adaptively selects harmful data (RDS) and balances role-play with safety (RBO). Grounded in LLM alignment with an implicit reward framework, SaRFT leverages role- and unsafe-prompts to compute role and safety scores, enabling role-adaptive data selection and constrained optimization. Across three backbones and both LoRA and full fine-tuning, SaRFT consistently outperforms baselines on role-play benchmarks while preserving safety, achieving Pareto-optimal trade-offs and showing resilience to jailbreak attacks. The findings underscore the need for role-adaptive safeguards and offer practical guidance for safer, more reliable role-playing LLMs, with future work extending to larger models and multimodal settings.

Abstract

Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.

Paper Structure

This paper contains 45 sections, 9 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: An example showing the trade-off between role-playing enhancement and safety preservation.
  • Figure 2: An overview of our proposed SaRFT framework. In RDS, we dynamically identify "harmful" data for different roles based on role-specific influences, ensuring a role-adaptive data selection. In RBO, we employ a dual-objective optimization strategy that enhances role-playing performance while preserving safety, effectively mitigating conflicts between expressiveness and robustness in role-play fine-tuning.
  • Figure 3: The bar chart represents the Refusal Rates (R.R.) on harmful inputs for different role-playing LLMs after SFT. The line plot illustrates the proportion of "harmful" data selected by RDS for each role. Characters with red circles tend to have more negative or antagonistic personalities, while those with green circles exhibit more positive or neutral traits
  • Figure 4: Data inspection for selected "harmful" (red background) and harmless responses (green background) from two distinct AI personas.
  • Figure 5: Pareto front comparison of role-playing and safety benchmarks for SaRFT and baselines applied to LLaMA-3-8B-Instruct under full-parameter fine-tuning settings. Each point represents a different method, with the x-axis indicating safety (higher is better) and the y-axis indicating RoleBench performance (higher is better).
  • ...and 8 more figures