Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Weixiang Zhao; Yulin Hu; Yang Deng; Jiahe Guo; Xingyu Sui; Xinyang Han; An Zhang; Yanyan Zhao; Bing Qin; Tat-Seng Chua; Ting Liu

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

TL;DR

This paper addresses safety risks in role-playing fine-tuning of LLMs by quantifying safety degradation across 95 role-specific models and introducing SaRFT, a two-stage method that adaptively selects harmful data (RDS) and balances role-play with safety (RBO). Grounded in LLM alignment with an implicit reward framework, SaRFT leverages role- and unsafe-prompts to compute role and safety scores, enabling role-adaptive data selection and constrained optimization. Across three backbones and both LoRA and full fine-tuning, SaRFT consistently outperforms baselines on role-play benchmarks while preserving safety, achieving Pareto-optimal trade-offs and showing resilience to jailbreak attacks. The findings underscore the need for role-adaptive safeguards and offer practical guidance for safer, more reliable role-playing LLMs, with future work extending to larger models and multimodal settings.

Abstract

Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

TL;DR

Abstract

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)