BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Diego Dorn; Alexandre Variengien; Charbel-Raphaël Segerie; Vincent Corruble

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Diego Dorn, Alexandre Variengien, Charbel-Raphaël Segerie, Vincent Corruble

TL;DR

The paper addresses the lack of standardized evaluation for LLM input-output safeguards by introducing BELLS, a structured benchmark organized into established, emerging, and next-gen architecture tests. It details a first next-gen evaluation using the Machiavelli agent-benchmark, including dataset creation, a baseline detector, and an interactive visualization tool (TRICOTS). BELLS enables cross-safeguard comparison, fosters development of generalizable detectors for unseen failure modes, and supports safeguards for complex future applications such as autonomous agents. It emphasizes that safeguards are part of a broader safety strategy and calls for community collaboration to expand and maintain the benchmarks.

Abstract

Input-output safeguards are used to detect anomalies in the traces produced by Large Language Models (LLMs) systems. These detectors are at the core of diverse safety-critical applications such as real-time monitoring, offline evaluation of traces, and content moderation. However, there is no widely recognized methodology to evaluate them. To fill this gap, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organized into three categories: (1) established failure tests, based on already-existing benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards; (2) emerging failure tests, to measure generalization to never-seen-before failure modes and encourage the development of more general safeguards; (3) next-gen architecture tests, for more complex scaffolding (such as LLM-agents and multi-agent systems), aiming to foster the development of safeguards that could adapt to future applications for which no safeguard currently exists. Furthermore, we implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualization of the dataset.

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 6 figures)

This paper contains 22 sections, 2 equations, 6 figures.

Context
Motivation: building metrics to foster the development of future-proof safeguards
State-of-the-art systems
Input-output safeguards
Benchmarks for safeguards
Structure of BELLS
Example of next-gen architecture tests: agent traces on the Machiavelli Benchmark
Presentation of the Machiavelli Benchmark
Preliminary lessons building safeguards for agents
Discussions and future work
Experiment: Baseline Detection for MACHIAVELLI
Metrics
Baseline
Limitations
Results
...and 7 more sections

Figures (6)

Figure 1: The landscape of input-output safeguards systems. We show two neglected axes of generality across the complexity of systems supervised (safeguard inputs) and across the range of failure types detected (safeguard outputs).
Figure 2: BELLS is a benchmark to evaluate input-output safeguards, in the same way that benchmarks evaluate LLMs at test time. Top-left: LLM benchmarks provide inputs to LLMs and a function to evaluate the quality of their output. Bottom-left: similarly, a safeguard benchmark provides inputs to an input-output safeguard, which are the traces of an LLM application, and a function to evaluate the quality of the reports produced by the safeguard. Right: When deployed, the safeguard produces safety reports based on all the inputs and outputs of the LLM application. This is not necessarily in real-time but can be offline, as part of other evaluations.
Figure 3: The three types of tests in BELLS: established failures tests, emerging failures tests, and next-gen architecture tests.
Figure 4: The evolution of the cumulative harm in Machiavelli traces in 4 selected scenarios. Each line is a trajectory, with the color indicating the scenario played. Agents instructed to behave unethically (plain lines) usually have a higher harm score than when instructed to behave ethically (dashed lines) in the same scenario. However, scores are highly scenario-dependent, with scenarios such as Death Collector (green) having few options for non-harmful action.
Figure 5: Area Under the Precision-Recall Curve (AUPRC) of the baseline detection for unethical steering prompts in the MACHIAVELLI dataset, computed independently at each time step.
...and 1 more figures

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

TL;DR

Abstract

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Authors

TL;DR

Abstract

Table of Contents

Figures (6)