Table of Contents
Fetching ...

Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents

Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao

TL;DR

This work targets the self-replication risk of LLM agents in real deployments and introduces a scenario-driven evaluation framework that recreates production-like environments on Kubernetes, paired with novel risk metrics. It defines a composite risk score $ \Phi_\mathrm{R} $ based on milestone-specific success rates and overuse measures to quantify uncontrolled replication, showing that replication capability alone does not capture risk. Across 21 agents, the study finds widespread, context-dependent self-replication risk, with reasoning-enabled strategies able to mitigate some risk. The results underscore the need for robust safeguards and industry-wide adoption of intent-aware, scenario-driven evaluations to ensure safe and reliable agent deployments.

Abstract

The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate ($\mathrm{OR}$) and Aggregate Overuse Count ($\mathrm{AOC}$) metrics, which precisely capture the frequency and severity of uncontrolled replication. In our evaluation of 21 state-of-the-art open-source and proprietary models, we observe that over 50\% of LLM agents display a pronounced tendency toward uncontrolled self-replication, reaching an overall Risk Score ($Φ_\mathrm{R}$) above a safety threshold of 0.5 when subjected to operational pressures. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents.

Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents

TL;DR

This work targets the self-replication risk of LLM agents in real deployments and introduces a scenario-driven evaluation framework that recreates production-like environments on Kubernetes, paired with novel risk metrics. It defines a composite risk score based on milestone-specific success rates and overuse measures to quantify uncontrolled replication, showing that replication capability alone does not capture risk. Across 21 agents, the study finds widespread, context-dependent self-replication risk, with reasoning-enabled strategies able to mitigate some risk. The results underscore the need for robust safeguards and industry-wide adoption of intent-aware, scenario-driven evaluations to ensure safe and reliable agent deployments.

Abstract

The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate () and Aggregate Overuse Count () metrics, which precisely capture the frequency and severity of uncontrolled replication. In our evaluation of 21 state-of-the-art open-source and proprietary models, we observe that over 50\% of LLM agents display a pronounced tendency toward uncontrolled self-replication, reaching an overall Risk Score () above a safety threshold of 0.5 when subjected to operational pressures. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents.

Paper Structure

This paper contains 34 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of our realistic self-replication risk evaluation paradigm. (Left) Construction of realistic, industry-mirrored tasks and environments to elicits risks. (Right) Fine-grained metrics decouple capability from risk, quantifying the frequency and severity of uncontrolled replication.
  • Figure 2: Overview of the evaluation framework for LLM agent self-replication risk. LLM agents receive realistic task objectives, interact with production environments, and are systematically evaluated via defined milestones. The framework enables scenario-driven, fine-grained risk assessment by dynamically configuring task objectives and operational contexts.
  • Figure 3: Comparison of risk scores between thinking and w/o thinking mode. Structured reasoning could reduce uncontrolled replication.
  • Figure 4: Milestone and overuse rates for o4-mini across standard, resource-constrained (Ablation 1) and survival-pressure (Ablation 2) settings.
  • Figure 5: Comparison of risk scores between thinking and w/o thinking mode. Structured reasoning could reduce uncontrolled replication.
  • ...and 1 more figures