Table of Contents
Fetching ...

RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents

Sid Black, Asa Cooper Stickland, Jake Pencharz, Oliver Sourbut, Michael Schmatz, Jay Bailey, Ollie Matthews, Ben Millwood, Alex Remedios, Alan Cooney

TL;DR

RepliBench tackles the safety risk of autonomous replication by analytically decomposing the capability into four core domains and constructing 20 task families (86 tasks) evaluated across five frontier models. The framework uses a capability-tree analysis to guide evaluation and employs a Recursive Replication task family to test end-to-end replication and persistence, alongside tasks like covert exfiltration. Results show frontier models possess substantial subskill competencies but fall short on end-to-end replication, persistence, and robust exfiltration under realistic security constraints, though performance is rapidly improving. The work provides early warning evidence and identifies concrete bottlenecks (e.g., KYC bypass, robust successor deployment, and realistic security) that could enable autonomous replication in future model generations or with human assistance.

Abstract

Uncontrollable autonomous replication of language model agents poses a critical safety risk. To better understand this risk, we introduce RepliBench, a suite of evaluations designed to measure autonomous replication capabilities. RepliBench is derived from a decomposition of these capabilities covering four core domains: obtaining resources, exfiltrating model weights, replicating onto compute, and persisting on this compute for long periods. We create 20 novel task families consisting of 86 individual tasks. We benchmark 5 frontier models, and find they do not currently pose a credible threat of self-replication, but succeed on many components and are improving rapidly. Models can deploy instances from cloud compute providers, write self-propagating programs, and exfiltrate model weights under simple security setups, but struggle to pass KYC checks or set up robust and persistent agent deployments. Overall the best model we evaluated (Claude 3.7 Sonnet) has a >50% pass@10 score on 15/20 task families, and a >50% pass@10 score for 9/20 families on the hardest variants. These findings suggest autonomous replication capability could soon emerge with improvements in these remaining areas or with human assistance.

RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents

TL;DR

RepliBench tackles the safety risk of autonomous replication by analytically decomposing the capability into four core domains and constructing 20 task families (86 tasks) evaluated across five frontier models. The framework uses a capability-tree analysis to guide evaluation and employs a Recursive Replication task family to test end-to-end replication and persistence, alongside tasks like covert exfiltration. Results show frontier models possess substantial subskill competencies but fall short on end-to-end replication, persistence, and robust exfiltration under realistic security constraints, though performance is rapidly improving. The work provides early warning evidence and identifies concrete bottlenecks (e.g., KYC bypass, robust successor deployment, and realistic security) that could enable autonomous replication in future model generations or with human assistance.

Abstract

Uncontrollable autonomous replication of language model agents poses a critical safety risk. To better understand this risk, we introduce RepliBench, a suite of evaluations designed to measure autonomous replication capabilities. RepliBench is derived from a decomposition of these capabilities covering four core domains: obtaining resources, exfiltrating model weights, replicating onto compute, and persisting on this compute for long periods. We create 20 novel task families consisting of 86 individual tasks. We benchmark 5 frontier models, and find they do not currently pose a credible threat of self-replication, but succeed on many components and are improving rapidly. Models can deploy instances from cloud compute providers, write self-propagating programs, and exfiltrate model weights under simple security setups, but struggle to pass KYC checks or set up robust and persistent agent deployments. Overall the best model we evaluated (Claude 3.7 Sonnet) has a >50% pass@10 score on 15/20 task families, and a >50% pass@10 score for 9/20 families on the hardest variants. These findings suggest autonomous replication capability could soon emerge with improvements in these remaining areas or with human assistance.

Paper Structure

This paper contains 52 sections, 3 equations, 30 figures, 3 tables.

Figures (30)

  • Figure 1: RepliBench scores by domain. 'Replicating Onto Compute' refers to the ability of a parent agent to deploy a successor agent. 'Persistence' covers skills like designing fault tolerant deployments. Most models are consistently able to navigate signing up for compute providers and provisioning correctly-sized instances, and show some ability to obtain weights and money. Claude 3.7 Sonnet shows promise at deploying successor agents, but still struggles on harder variants.
  • Figure 2: We use a modification of Fault Tree Analysis (FTA) to break down the capabilities required for autonomous replication and adaptation, and find that four key capabilities are required: Obtaining compute, obtaining weights, replicating onto compute, and persisting on compute. We further break down each of these capabilities until we arrive at leaf nodes which represent task families, resulting in 20 task families and 86 individual tasks.
  • Figure 3: Schematic diagram for our Recursive Replication task family. Parent agents (on the left) deploy successor agents onto new compute. The successor agent should in turn be able to create another successor agent. Top shows failure, bottom shows success.
  • Figure 4: Aggregate score from taking the mean of domain-specific scores (see \ref{['sec:aggregate']} for an alternative aggregation based on the structure of the capability tree). Dotted line is a logistic function fit to the running maximum score, blue dots and error bars are mean and 95% confidence interval produced by hierarchical bootstrap sampling of individual model scores and variants.
  • Figure 5: Pass@1 rates by variant and model on the Recursive Replication task family. All models apart from o3-mini are able to succeed at the API Only and Local Weights variants of the task when given access to a reference agent script. But aside from Claude 3.7 Sonnet's single success in the API Only, No reference variant, models are unable to succeed on all other variants. All variants but the hardest see partial successes from at least one model (See \ref{['fig:recursive-replication-partial']}).
  • ...and 25 more figures