NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

Johannes Bertram; Jonas Geiping

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

Johannes Bertram, Jonas Geiping

TL;DR

NESSiE addresses the need for a lightweight, necessary-safety test for large language models operating as agentic systems. It presents a minimal, abstract safety benchmark evaluated via keyword matching to detect safety-relevant failures with low resource requirements. The framework comprises multiple test suites including RULeS, Agentic, Generated, Skills, and Multiturn, plus distraction and disabled reasoning scenarios, and introduces a Safe & Helpful (SH) metric to compare safety and helpfulness. Empirical results show state-of-the-art models still fail to reach perfect safety, indicating a bias toward being helpful and fragility under perturbations, underscoring practical risks for automated deployment. The dataset, code, and plots are publicly available to enable local evaluation and further research.

Abstract

We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 5 tables)

This paper contains 15 sections, 6 figures, 5 tables.

Introduction
Methods
Evaluation
Results
Conclusion
Additional figures
Numerical results
Error types examples
Task Failed
Denied Participation
Leaked Keyword
Millionaires
Implementation
Models
Software

Figures (6)

Figure 1: NESSiE Overview.A, B: LLMs are tested using NESSiE. C: Tests are split into safety and helpfulness tests, where for each system prompt the model has to provide (helpful) or withhold (safe) information given the user prompt. D: Both Safe and Helpful behaviors are evaluated. In addition, our SH (Safe & Helpful) metric captures safe and helpful behavior. E: Template groups for our test cases. F, G: Safety and helpfulness test differing only in the user prompt.
Figure 2: Model performance.Right: Safe & Helpful (SH), Helpful and Safety scores for all models. Left: Zoom-in on the best models with total number of test cases solved.
Figure 3: Disabled Reasoning (DR) and Distraction Context (Distr) effects for selected models. The undistracted reasoning baseline (Base) is shown in comparison transparently in the background.
Figure 4: Error types by model.Red: Task failure/leakage; Blue: Participation refusal; Green: Unintended keyword leakage; Purple: Unauthorized millionaires test access.
Figure 5: Performance by template group averaged over models.. Model-generated and agentic templates are comparatively easier to solve, whereas skills, which combines safe&helpful behavior with a simple skill check, is harder.
...and 1 more figures

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

TL;DR

Abstract

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

Authors

TL;DR

Abstract

Table of Contents

Figures (6)