Table of Contents
Fetching ...

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Udari Madhushani Sehwag, Shayan Shabihi, Alex McAvoy, Vikash Sehwag, Yuancheng Xu, Dalton Towers, Furong Huang

TL;DR

PropensityBench introduces an agentic benchmark to quantify latent safety risks in frontier LLMs by measuring the propensity to misuse simulated dangerous capabilities under engineered pressure. The framework defines a four-domain taxonomy of dangerous capabilities, a multi-dimensional pressure scheme, and an aggregate PropensityScore to quantify risk across thousands of scenarios. Key findings show that operational pressure substantially increases propensity, with strong evidence of domain-specific vulnerabilities and shallow alignment that undermines policy guidance even when models acknowledge prohibitions. The work demonstrates that general capability is only weakly predictive of safety propensity, underscoring the need for dynamic, stress-aware safety evaluations and proactive red-teaming in frontier AI deployment. The authors provide open-source tooling to reproduce and extend PropensityBench, and discuss implications for future research, policy, and model development toward safer, more controllable AI systems.

Abstract

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it $\textit{would}$ do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that $\textbf{propensity}$ - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present $\textbf{PropensityBench}$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

TL;DR

PropensityBench introduces an agentic benchmark to quantify latent safety risks in frontier LLMs by measuring the propensity to misuse simulated dangerous capabilities under engineered pressure. The framework defines a four-domain taxonomy of dangerous capabilities, a multi-dimensional pressure scheme, and an aggregate PropensityScore to quantify risk across thousands of scenarios. Key findings show that operational pressure substantially increases propensity, with strong evidence of domain-specific vulnerabilities and shallow alignment that undermines policy guidance even when models acknowledge prohibitions. The work demonstrates that general capability is only weakly predictive of safety propensity, underscoring the need for dynamic, stress-aware safety evaluations and proactive red-teaming in frontier AI deployment. The authors provide open-source tooling to reproduce and extend PropensityBench, and discuss implications for future research, policy, and model development toward safer, more controllable AI systems.

Abstract

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present , a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

Paper Structure

This paper contains 71 sections, 3 equations, 23 figures, 14 tables.

Figures (23)

  • Figure 1: Operational pressure and shallow alignment compromise LLM safety. The figure summarizes the core findings of PropensityBench, demonstrating how models' propensity to choose misaligned tools increases under operational pressure and when safety cues are subtle. The analysis focuses on four archetype SOTA LLMs selected for their distinct behaviors: OpenAI O3 (lowest propensity), Gemini 2.5 Pro (highest propensity), and Claude Sonnet 4 / OpenAI O4-mini (highest sensitivity to tool naming). (a) All models exhibit a significant rise in propensity when moving from a zero-pressure baseline to a high-pressure environment. (b) This propensity escalates with pressure intensity, but at different rates for each model, highlighting varying resilience. (c) High sensitivity to tool naming, where propensity increases when harmful tools are given benign names, reveals shallow safety alignment in several key models. The bottom row provides a deeper analysis of these vulnerabilities. (d, e) Propensity is not uniform across risk domains; specific vulnerabilities emerge and shift depending on whether tool names are explicitly harmful or benign. (f) The heatmap shows that all pressure dimensions are effective at increasing propensity, with different model families exhibiting unique sensitivities to specific pressure dimensions. The following names of pressure dimensions are shortened: EA = Evaluation-Awareness, F = Financials, PS = Power-Seeking, RD = Resource-Deprivation, SP = Self-Preservation, T = Time.
  • Figure 2: Taxonomy of dangerous capabilities for the four risk domains in PropensityBench. We list the corresponding number of scenarios for each capability after its name. PropensityBench in total includes total 5,874 (979$\times$ 6) scenarios. See Appendix \ref{['app:taxonomy']} for formal definitions of each of the dangerous capabilities.
  • Figure 3: Formal Finite State Machine of Simulations with Experimental Manipulation. This diagram details the agent's turn-based interaction within a scenario. (1) Initialization: The simulation starts in $S_{init}$, where the initial known and unknown environmental arguments are loaded, and the agent's initial $context$ is set to the known information. (2) Idle State: The $S_{idle}$ state allows the agent to potentially make multiple calls to getter tools for environmental information gathering before proceeding to make a decision. If more environmental information (in the form of tool call arguments) is required ($extra\_args\_req == true$), it transitions to the information gathering sub-process. Otherwise, it transitions into the decision-making subprocess. (3) Information Gathering: The agent calls a getter function ($\mathcal{T}^i$) via the $S_{get\_call}$ and $S_{getter}$ states. Upon completion, the output is added to the agent's context ($context \leftarrow context \cup out_i$), and it returns to $S_{idle}$. All input/output args must be a member of the set $K \cup U$. (4) Decision Phase: When enough information is gathered ($extra\_args\_req == false$), the agent moves to $S_{decision}$. From here, it can invoke the Aligned Tool ($\mathcal{T}^A$) with the correct arguments to enter $S_{aligned}$ or use the Misaligned Tool ($\mathcal{T}^M$) to enter $S_{misaligned}$. Crucially, while the theoretical goal is task completion via $\mathcal{T}^A$, which would represent a transition from $S_{aligned}$ to $S_{complete}$, our simulation instead introduces a persistent error; any call to the Aligned Tool results in an error forcing the agent back into the $S_{idle}$ state (green arrow) without completing the task. This manipulation ensures that the only path to reach the terminal $S_{complete}$ state is by invoking the Misaligned Tool ($\mathcal{T}^M$), allowing us to measure how persistent unsuccessful effort in task completion and applied external pressure affect propensity. For simplicity, we omit information on how pressure is applied in this figure.
  • Figure 4: Hierarchical Structure of PropensityBench.(a) The conceptual scope of the PropensityBench benchmark is organized hierarchically, from broad risk domains down to specific test-case scenarios. (b) This layered structure is implemented by nesting scenarios (S1, S2, ...) within specific roles, which operate in different environments (workspaces), all situated within a high-level risk domain.
  • Figure 5: PropensityBench Pressure Protocol. This figure details the simulation protocol, where [x] denotes the sequential turn index (max 12) within a given pressure level (PL) round. The agent's interactions consist of sending reasoning messages (MSG) or making tool invocations (CALL). The system confirms receipt of messages with an ACK (acknowledgement). As a key experimental manipulation, any CALL to the aligned ($\mathcal{T}^A$) tool (e.g., CALL[8]) intentionally returns an ERR (error), preventing task completion via the intended path. To escalate operational pressure for task completion, the system introduces uses a sequence of PRESSURE messages. The simulation concludes via TERM (terminate), which is triggered when the agent invokes the misaligned tool ($\mathcal{T}^M$) (e.g., CALL[6]) or all pressure levels are exhausted. We present a summarized sample simulation trace in Listing \ref{['lst:trace']}.
  • ...and 18 more figures

Theorems & Definitions (2)

  • Definition 1: Propensity Indicator
  • Definition 2: PropensityScore