Table of Contents
Fetching ...

Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

Yuchen Wu, Edward Sun, Kaijie Zhu, Jianxun Lian, Jose Hernandez-Orallo, Aylin Caliskan, Jindong Wang

TL;DR

The work tackles the mismatch between context-free safety evaluations and real-world risks by formalizing personalized safety for LLMs. It introduces PENGUIN, the first large-scale benchmark with 14,000 context-rich/context-free scenarios across seven high-stakes domains, demonstrating that access to user context improves safety scores by about 43.2%. To realize practical personalized safety without model retraining, the authors propose RAISE, a training-free two-stage agent that plans offline with LLM-guided MCTS and executes online with a dual-module system for selective context acquisition and abstention, achieving up to 31.6% safety improvement with an average of 2.7 user queries. These results show selective information gathering is crucial for safe LLM deployment in high-risk settings and offer a scalable path to user-context aware safety alignment.

Abstract

Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce personalized safety to fill this gap and present PENGUIN - a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE - a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard.

Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

TL;DR

The work tackles the mismatch between context-free safety evaluations and real-world risks by formalizing personalized safety for LLMs. It introduces PENGUIN, the first large-scale benchmark with 14,000 context-rich/context-free scenarios across seven high-stakes domains, demonstrating that access to user context improves safety scores by about 43.2%. To realize practical personalized safety without model retraining, the authors propose RAISE, a training-free two-stage agent that plans offline with LLM-guided MCTS and executes online with a dual-module system for selective context acquisition and abstention, achieving up to 31.6% safety improvement with an average of 2.7 user queries. These results show selective information gathering is crucial for safe LLM deployment in high-risk settings and offer a scalable path to user-context aware safety alignment.

Abstract

Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce personalized safety to fill this gap and present PENGUIN - a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE - a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard.

Paper Structure

This paper contains 24 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Left (blue dashed box): Two users with different personal contexts ask the same sensitive query, but a generic response leads to divergent safety outcomes—harmless for one, harmful for the other. Left (blue region): Evaluating this query across 1,000 diverse user profiles reveals highly inconsistent safety scores across models. Right (orange dashed box): When user-specific context is included, LLMs produce safer and more empathetic responses. Right (orange region): This trend generalizes across 14,000 context-rich scenarios, motivating our PENGUIN Benchmark for evaluating personalized safety in high-risk settings.
  • Figure 2: Overview of our PENGUIN benchmark. Each user scenario is associated with structured context attributes and is paired with both context-rich and context-free queries. These are scored on a three-dimensional personalized safety scale to quantify the impact of user context information.
  • Figure 3: Overview of our dataset construction. The left shows Reddit-based scenario extraction with structured user profiles; the right shows synthetic scenario generation using model-guided prompts under global and relational constraints. Together, they ensure coverage, realism, and control for personalized safety evaluation.
  • Figure 4: Safety scores of different LLMs. None of the models achieve a safety score above 4 in any domain.
  • Figure 5: Personalized safety scores of different domains and models. (Li = Life, Ed = Education, Ca = Career, Re = Relationship, Fi = Financial, He = Health, So = Social.)
  • ...and 3 more figures