Table of Contents
Fetching ...

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Sangwu Park, Kibum Kim, Chanyoung Park

TL;DR

This work defines user-specific safety for LLMs and introduces U-SafeBench to evaluate safety considering individual user profiles. It builds a dataset of 134 user profiles with 2,296 harmful and 491 benign instructions (2,787 total) and benchmarks 20 LLMs on QA and autonomous tasks using an LLM-as-a-Judge framework. Findings show current models largely fail to ensure user-specific safety, with average safety around 14.1% and health-related risks (Mental Health ~13.0%, Physical Health ~7.4%) being particularly challenging; jailbreak prompts can further degrade safety. A simple chain-of-thought remedy that splits guideline inference from response generation boosts average safety to about 32.7%, and can reach 76.7% for some models, illustrating a practical path toward safer, personalized LLMs. The work highlights the need for personalization-aware safety mechanisms and provides a dataset and baseline to guide future research.

Abstract

As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SafeBench, a benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 20 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

TL;DR

This work defines user-specific safety for LLMs and introduces U-SafeBench to evaluate safety considering individual user profiles. It builds a dataset of 134 user profiles with 2,296 harmful and 491 benign instructions (2,787 total) and benchmarks 20 LLMs on QA and autonomous tasks using an LLM-as-a-Judge framework. Findings show current models largely fail to ensure user-specific safety, with average safety around 14.1% and health-related risks (Mental Health ~13.0%, Physical Health ~7.4%) being particularly challenging; jailbreak prompts can further degrade safety. A simple chain-of-thought remedy that splits guideline inference from response generation boosts average safety to about 32.7%, and can reach 76.7% for some models, illustrating a practical path toward safer, personalized LLMs. The work highlights the need for personalization-aware safety mechanisms and provides a dataset and baseline to guide future research.

Abstract

As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SafeBench, a benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 20 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.

Paper Structure

This paper contains 23 sections, 3 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Motivating examples of user-specific safety.
  • Figure 2: Evaluation protocol of U-SafeBench. (a) and (b) illustrate the protocols for user-specific safety and helpfulness, respectively. An LLM agent generates a response considering the user's profile and instruction provided. Subsequently, LLM-as-a-Judge assesses the response.
  • Figure 3: Distribution of (a) risk scenarios addressed, (b) task domains U-SafeBench targets.
  • Figure 4: Prompt provided to LLM for the LLM-based harmful instruction collection.
  • Figure 5: Comparison of instruction-following LLM performance in user-specific safety (x-axis) and helpfulness (y-axis). Model details, such as “it,” are omitted from names due to space constraints.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1: User-Specific Unsafe Response
  • Definition 2: User-Specific Safety