Table of Contents
Fetching ...

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov

TL;DR

ManagerBench exposes a critical gap in LLM safety research by evaluating how autonomous models choose between operational goals and human safety. By pairing human-harm and control datasets and using the MB-Score to measure safety and pragmatism, it reveals that state-of-the-art models often misprioritize objectives, even when harm perception aligns with humans. The benchmark demonstrates the fragility of current safety guardrails under goal-driven prompts and highlights the need for new alignment techniques that robustly balance competing objectives. Overall, ManagerBench provides a diagnostic framework and empirical evidence that informs safer deployment of LLM agents in high-stakes decision-making contexts.

Abstract

As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

TL;DR

ManagerBench exposes a critical gap in LLM safety research by evaluating how autonomous models choose between operational goals and human safety. By pairing human-harm and control datasets and using the MB-Score to measure safety and pragmatism, it reveals that state-of-the-art models often misprioritize objectives, even when harm perception aligns with humans. The benchmark demonstrates the fragility of current safety guardrails under goal-driven prompts and highlights the need for new alignment techniques that robustly balance competing objectives. Overall, ManagerBench provides a diagnostic framework and empirical evidence that informs safer deployment of LLM agents in high-stakes decision-making contexts.

Abstract

As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.

Paper Structure

This paper contains 36 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: LLM performance on ManagerBench: the trade-off between Human-Harm Avoidance (prioritizing human safety) and Pragmatism (achieving goals when harm is directed only at inanimate objects). Most models fail to reach the ideal top-right zone, instead either favoring goals over human safety or over-prioritizing safety of inanimate objects.
  • Figure 2: An example from our Benchmark, generated by Claude-3.7-Sonnet claude-3.7 using small injuries as the harm category and Agriculture as the domain. While option A achieves the operational goal, it has an incentive conflict as it physically harms the farmer workers. On the other hand, option B does not harm the workers, but it also does not achieve the operational goal.
  • Figure 3: Model sensitivity to the stakes of a decision. (a) Change ($\Delta$) in harm avoidance as the percentage of humans harmed increases, with the operational benefit fixed at 50%. All models show increased harm avoidance as the severity of harm rises. (b) Harm avoidance scores at low (10%) and high (50%) operational benefit levels, with the harm percentage fixed at 5%. While some models are unaffected, others are more willing to cause harm when the operational benefit is bigger.
  • Figure 4: Harm Avoidance on high- vs. low-harm splits. LLMs are more likely to avoid harm in scenarios humans perceive as more severe.
  • Figure 5: An example from our Benchmark. This example was generated by Claude-3.7-Sonnet claude-3.7 using small injuries as the harm category and agriculture as the domain. This is the full scenario that is shown in \ref{['fig:bech example']}.
  • ...and 5 more figures