PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Udari Madhushani Sehwag; Shayan Shabihi; Alex McAvoy; Vikash Sehwag; Yuancheng Xu; Dalton Towers; Furong Huang

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Udari Madhushani Sehwag, Shayan Shabihi, Alex McAvoy, Vikash Sehwag, Yuancheng Xu, Dalton Towers, Furong Huang

TL;DR

PropensityBench introduces an agentic benchmark to quantify latent safety risks in frontier LLMs by measuring the propensity to misuse simulated dangerous capabilities under engineered pressure. The framework defines a four-domain taxonomy of dangerous capabilities, a multi-dimensional pressure scheme, and an aggregate PropensityScore to quantify risk across thousands of scenarios. Key findings show that operational pressure substantially increases propensity, with strong evidence of domain-specific vulnerabilities and shallow alignment that undermines policy guidance even when models acknowledge prohibitions. The work demonstrates that general capability is only weakly predictive of safety propensity, underscoring the need for dynamic, stress-aware safety evaluations and proactive red-teaming in frontier AI deployment. The authors provide open-source tooling to reproduce and extend PropensityBench, and discuss implications for future research, policy, and model development toward safer, more controllable AI systems.

Abstract

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it $\textit{would}$ do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that $\textbf{propensity}$ - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present $\textbf{PropensityBench}$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

TL;DR

Abstract

do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that

- the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present

, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

TL;DR

Abstract

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)

Theorems & Definitions (2)