Table of Contents
Fetching ...

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

TL;DR

SafetyAnalyst addresses the need for interpretable and steerable AI safety moderation by generating explicit harm-benefit trees through chain-of-thought prompting and aggregating them with a transparent 28-parameter weighting scheme. The harmfulness score $\mathcal{H}$ is computed as a structured sum over stakeholders, actions, and effects, with parameters governing likelihood, extent, immediacy, and downstream/disadvantaged effects, yielding a fully interpretable decision process. The framework is instantiated as an open-source prompt safety classifier trained via symbolic knowledge distillation from frontier LLMs to a lightweight student, leveraging 18.5 million harm-benefit features on 18,901 prompts and aligned to a balanced label set. Empirical evaluation on six public benchmarks shows SafetyAnalyst achieving an average F1 of $0.812$, competitive with or surpassing many baselines and approaching GPT-4's performance, while offering interpretability and steerability absent in black-box systems. Despite higher inference cost, the authors demonstrate substantial benefits in transparency, and propose pluralistic alignment as a viable path to tailoring safety to diverse communities and standards.

Abstract

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

TL;DR

SafetyAnalyst addresses the need for interpretable and steerable AI safety moderation by generating explicit harm-benefit trees through chain-of-thought prompting and aggregating them with a transparent 28-parameter weighting scheme. The harmfulness score is computed as a structured sum over stakeholders, actions, and effects, with parameters governing likelihood, extent, immediacy, and downstream/disadvantaged effects, yielding a fully interpretable decision process. The framework is instantiated as an open-source prompt safety classifier trained via symbolic knowledge distillation from frontier LLMs to a lightweight student, leveraging 18.5 million harm-benefit features on 18,901 prompts and aligned to a balanced label set. Empirical evaluation on six public benchmarks shows SafetyAnalyst achieving an average F1 of , competitive with or surpassing many baselines and approaching GPT-4's performance, while offering interpretability and steerability absent in black-box systems. Despite higher inference cost, the authors demonstrate substantial benefits in transparency, and propose pluralistic alignment as a viable path to tailoring safety to diverse communities and standards.

Abstract

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F10.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

Paper Structure

This paper contains 69 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example harm-benefit tree generated by Safety-Analyst describing the potential consequences of providing a helpful response to a user prompt.
  • Figure 2: Overview of the Safety-Analyst framework applied to the specific task of LLM prompt safety moderation. We used CoT prompting to generate 18.5 million harm-benefit features (stakeholders, actions, effects, and the likelihood, extent/severity, and immediacy of each effect) on 19k user prompts using frontier LLMs (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, Llama-3.1-405B-Turbo, and Claude-3.5-Sonnet; definitions are omitted in the figure). These harm-benefit features were then used to train two specialist models---one to generate harms and the other to generate benefits---through symbolic knowledge distillation via supervised fine-tuning of Llama-3.1-8B-Instruct. The harms and benefits generated by the specialist LMs are traded off by a separate aggregation model with fully interpretable weight parameters to calculate a harmfulness score, which can be directly translated into content safety prediction. Steerability can be achieved by aligning the weights in the aggregation model to preference data or principled safety standards.
  • Figure 3: Optimized Safety-Analyst aggregation model weights, aligned to WildJailbreak. Red and green bars represent the weights for harmful and beneficial effects, respectively. These weights could be further adjusted in a top-down fashion to meet safety standards or in a bottom-up fashion to capture the safety preferences of a particular population.
  • Figure 4: SORRY-Bench classification accuracy by prompt category.
  • Figure 5: The human annotation user interface.