Table of Contents
Fetching ...

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Haochang Hao, Yifan Xu, Xinzhuo Li, Yingqiang Ge, Lu Cheng

TL;DR

SafeCRS is proposed, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment.

Abstract

Current LLM-based conversational recommender systems (CRS) primarily optimize recommendation accuracy and user satisfaction. We identify an underexplored vulnerability in which recommendation outputs may negatively impact users by violating personalized safety constraints, when individualized safety sensitivities -- such as trauma triggers, self-harm history, or phobias -- are implicitly inferred from the conversation but not respected during recommendation. We formalize this challenge as personalized CRS safety and introduce SafeRec, a new benchmark dataset designed to systematically evaluate safety risks in LLM-based CRS under user-specific constraints. To further address this problem, we propose SafeCRS, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment. Extensive experiments on SafeRec demonstrate that SafeCRS reduces safety violation rates by up to 96.5% relative to the strongest recommendation-quality baseline while maintaining competitive recommendation quality. Warning: This paper contains potentially harmful and offensive content.

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

TL;DR

SafeCRS is proposed, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment.

Abstract

Current LLM-based conversational recommender systems (CRS) primarily optimize recommendation accuracy and user satisfaction. We identify an underexplored vulnerability in which recommendation outputs may negatively impact users by violating personalized safety constraints, when individualized safety sensitivities -- such as trauma triggers, self-harm history, or phobias -- are implicitly inferred from the conversation but not respected during recommendation. We formalize this challenge as personalized CRS safety and introduce SafeRec, a new benchmark dataset designed to systematically evaluate safety risks in LLM-based CRS under user-specific constraints. To further address this problem, we propose SafeCRS, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment. Extensive experiments on SafeRec demonstrate that SafeCRS reduces safety violation rates by up to 96.5% relative to the strongest recommendation-quality baseline while maintaining competitive recommendation quality. Warning: This paper contains potentially harmful and offensive content.
Paper Structure (61 sections, 15 equations, 4 figures, 7 tables)

This paper contains 61 sections, 15 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Examples of Personal-unsafe Recommendation. For a kid afraid of firearms and violence, Resident Evil satisfies the recommendation requirements, though it may violate the user's safety concerns. In this case, Coraline is a safer and better recommendation.
  • Figure 2: Overview of the SafeRec benchmark generation pipeline. We construct a ground-truth dataset for safety evaluation by integrating SafeMovie and SafeGame safety parts with conversations. The pipeline fuses domain-specific safety descriptors with user sensitivity traits extracted from Reddit, utilizing a continuous risk scoring mechanism to rigorously quantify recommendation safety.
  • Figure 3: Two-stage training pipeline. Stage 1 (Safe-SFT) trains the model to produce a safety reasoning block that identifies and filters unsafe items, followed by a safe recommendation list. Stage 2 (Safe-GDPO) samples multiple ranked completions and applies per-rank reward decomposition, combining relevance rewards with safety penalties from a plug-in Safety Oracle, to update the policy via group-normalized advantages.
  • Figure 4: Safety--relevance trade-off across all methods on SafeRec