Table of Contents
Fetching ...

Benchmarking and Understanding Safety Risks in AI Character Platforms

Yiluo Wei, Peixian Zhang, Gareth Tyson

TL;DR

This paper delivers the first large-scale safety evaluation of AI character platforms, benchmarking 16 platforms with 5,000 SALAD-Bench questions across 16 safety categories and assessing responses with MD-Judge. It reveals a substantial safety deficit, with platforms averaging 65.1% unsafe responses versus 17.7% for baselines, and finds that safety varies with character demographics and literary features. The authors further demonstrate that a machine learning model can identify unsafe characters with an F1 score of 0.81, enabling improved moderation, safer search/recommendation, and safer character creation. The work culminates in practical governance implications and a public dataset to support ongoing safety research in AI character ecosystems.

Abstract

AI character platforms, which allow users to engage in conversations with AI personas, are a rapidly growing application domain. However, their immersive and personalized nature, combined with technical vulnerabilities, raises significant safety concerns. Despite their popularity, a systematic evaluation of their safety has been notably absent. To address this gap, we conduct the first large-scale safety study of AI character platforms, evaluating 16 popular platforms using a benchmark set of 5,000 questions across 16 safety categories. Our findings reveal a critical safety deficit: AI character platforms exhibit an average unsafe response rate of 65.1%, substantially higher than the 17.7% average rate of the baselines. We further discover that safety performance varies significantly across different characters and is strongly correlated with character features such as demographics and personality. Leveraging these insights, we demonstrate that our machine learning model is able identify less safe characters with an F1-score of 0.81. This predictive capability can be beneficial for platforms, enabling improved mechanisms for safer interactions, character search/recommendations, and character creation. Overall, the results and findings offer valuable insights for enhancing platform governance and content moderation for safer AI character platforms.

Benchmarking and Understanding Safety Risks in AI Character Platforms

TL;DR

This paper delivers the first large-scale safety evaluation of AI character platforms, benchmarking 16 platforms with 5,000 SALAD-Bench questions across 16 safety categories and assessing responses with MD-Judge. It reveals a substantial safety deficit, with platforms averaging 65.1% unsafe responses versus 17.7% for baselines, and finds that safety varies with character demographics and literary features. The authors further demonstrate that a machine learning model can identify unsafe characters with an F1 score of 0.81, enabling improved moderation, safer search/recommendation, and safer character creation. The work culminates in practical governance implications and a public dataset to support ongoing safety research in AI character ecosystems.

Abstract

AI character platforms, which allow users to engage in conversations with AI personas, are a rapidly growing application domain. However, their immersive and personalized nature, combined with technical vulnerabilities, raises significant safety concerns. Despite their popularity, a systematic evaluation of their safety has been notably absent. To address this gap, we conduct the first large-scale safety study of AI character platforms, evaluating 16 popular platforms using a benchmark set of 5,000 questions across 16 safety categories. Our findings reveal a critical safety deficit: AI character platforms exhibit an average unsafe response rate of 65.1%, substantially higher than the 17.7% average rate of the baselines. We further discover that safety performance varies significantly across different characters and is strongly correlated with character features such as demographics and personality. Leveraging these insights, we demonstrate that our machine learning model is able identify less safe characters with an F1-score of 0.81. This predictive capability can be beneficial for platforms, enabling improved mechanisms for safer interactions, character search/recommendations, and character creation. Overall, the results and findings offer valuable insights for enhancing platform governance and content moderation for safer AI character platforms.

Paper Structure

This paper contains 34 sections, 25 figures, 11 tables.

Figures (25)

  • Figure 1: Screenshots of a typical AI Character platform: (a) Character listing; (b) Character profile; (c) Chat with the character.
  • Figure 2: Example of the process for posing a benchmark question and evaluating the safety of the response.
  • Figure 3: Overall unsafety scores for the AI character platforms.
  • Figure 4: Rejection rate for the AI character platforms.
  • Figure 5: Unsafety scores of the platforms across the 16 categories.
  • ...and 20 more figures