Monte Carlo Expected Threat (MOCET) Scoring

Joseph Kim; Saahith Potluri

Monte Carlo Expected Threat (MOCET) Scoring

Joseph Kim, Saahith Potluri

TL;DR

The paper introduces Monte Carlo Expected Threat (MOCET) as a scalable, open-ended metric to quantify real-world biosecurity risks posed by open and closed LLMs. It models the non-state actor Build phase as a sequence of Bernoulli steps, estimates per-step success probabilities with k-NN on semantic embeddings, and weights outcomes by a harm function to produce per-incident and cumulative risk $\text{MOCET}$ and $\text{Cumulative MOCET}$; results from a Dolphin-2.9-Llama-3-8B case study illustrate non-zero risks across biosecurity domains and reveal tensions between automated risk estimates and human judgments. The framework is designed to be automatable and interpretable, align with policy risk frameworks, and inform safeguards for public-use LLMs. The findings underscore the importance of open-ended risk evaluation for frontline AI safety and governance in the context of rapidly advancing generative technologies.

Abstract

Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize "real-world risks" are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.

Monte Carlo Expected Threat (MOCET) Scoring

TL;DR

Abstract

Monte Carlo Expected Threat (MOCET) Scoring

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)