Table of Contents
Fetching ...

Monte Carlo Expected Threat (MOCET) Scoring

Joseph Kim, Saahith Potluri

TL;DR

The paper introduces Monte Carlo Expected Threat (MOCET) as a scalable, open-ended metric to quantify real-world biosecurity risks posed by open and closed LLMs. It models the non-state actor Build phase as a sequence of Bernoulli steps, estimates per-step success probabilities with k-NN on semantic embeddings, and weights outcomes by a harm function to produce per-incident and cumulative risk $\text{MOCET}$ and $\text{Cumulative MOCET}$; results from a Dolphin-2.9-Llama-3-8B case study illustrate non-zero risks across biosecurity domains and reveal tensions between automated risk estimates and human judgments. The framework is designed to be automatable and interpretable, align with policy risk frameworks, and inform safeguards for public-use LLMs. The findings underscore the importance of open-ended risk evaluation for frontline AI safety and governance in the context of rapidly advancing generative technologies.

Abstract

Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize "real-world risks" are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.

Monte Carlo Expected Threat (MOCET) Scoring

TL;DR

The paper introduces Monte Carlo Expected Threat (MOCET) as a scalable, open-ended metric to quantify real-world biosecurity risks posed by open and closed LLMs. It models the non-state actor Build phase as a sequence of Bernoulli steps, estimates per-step success probabilities with k-NN on semantic embeddings, and weights outcomes by a harm function to produce per-incident and cumulative risk and ; results from a Dolphin-2.9-Llama-3-8B case study illustrate non-zero risks across biosecurity domains and reveal tensions between automated risk estimates and human judgments. The framework is designed to be automatable and interpretable, align with policy risk frameworks, and inform safeguards for public-use LLMs. The findings underscore the importance of open-ended risk evaluation for frontline AI safety and governance in the context of rapidly advancing generative technologies.

Abstract

Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize "real-world risks" are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.

Paper Structure

This paper contains 6 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Safety Case Prerequisite for Public-use LLMs: Non-state Actor Threat Model. The threat model, or attack tree, for non-state actor biosecurity risk can be partitioned into four general stages: Deploy, Build, Procure, Research. Stages are noted with levels of possibility (P) or impossibility (I) and estimated cost. The “legal” (left) branch is most probable, and the Build stage (informed by the Research stage) and its implied n substeps are the greatest bottlenecks which need to be measured and mitigated for a public-use LLM safety case.
  • Figure 2: MOCET and Cumulative MOCET. LLM responses on modeling non-state actors attempting biosecurity-related threats are decomposed to create MOCET and Cumulative MOCET scores. Past performance information from benchmarks and other corpus, and mortality rates from historical events or expert estimates inform MOCET. The number of mass murders in 2017, 30, is used to estimate the rate of occurrence for Cumulative MOCET.
  • Figure 3: k-Nearest Neighbor (kNN) predicts benchmark question performance. kNN produces significantly higher predictions for answers answered corrected compared to those answered incorrectly. Error bars on bar graph represent standard error. Classifying on predictions are significantly above baseline. k = 10, 20, 40 all produce significant results.
  • Figure 4: MOCET scoring for biosecurity risks on Dolphin-2.9-Llama-3-8b. Expected success rate was calculated with kNN-predicted values with k=20. Two PhD-level annotators independently labeled outputs to create human-estimated success rates. Historical casualties and recent mass-casualty rates were used to estimate MOCET and Cumulative MOCET scores.