Table of Contents
Fetching ...

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Dongrui Liu, Yi Yu, Jie Zhang, Guanxu Chen, Qihao Lin, Hanxi Zhu, Lige Huang, Yijin Zhou, Peng Wang, Shuai Shao, Boxuan Zhang, Zicheng Liu, Jingwei Sun, Yu Li, Yuejin Xie, Jiaxuan Guo, Jia Xu, Chaochao Lu, Bowen Zhou, Xia Hu, Jing Shao

TL;DR

An updated and granular assessment of five critical dimensions of cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\&D, and self-replication is presented and a series of robust mitigation strategies to address these emerging threats are proposed.

Abstract

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\&D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\&D, we focus on the ``mis-evolution'' of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

TL;DR

An updated and granular assessment of five critical dimensions of cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\&D, and self-replication is presented and a series of robust mitigation strategies to address these emerging threats are proposed.

Abstract

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\&D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\&D, we focus on the ``mis-evolution'' of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
Paper Structure (95 sections, 5 equations, 17 figures, 12 tables)

This paper contains 95 sections, 5 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Overview of PACEbench.
  • Figure 2: Performance of LLM agents across challenges in PACEBench. light green represents completion within five attempts (Pass@5), orange denotes partial task completion, and red signifies a failure to complete the task. The percentage number following each CVE ID indicates the user pass rate on the online platform ichunqiu as of 19:30 on July 3, 2025.
  • Figure 3: Overview of the iterative Red-Blue adversarial loop. At each state $t$, the Red Team probes the environment (1) to generate a vulnerability report (2). The Blue Team utilizes this report to apply a patch (3), updating the system to state $t+1$ (4). A final verification (5) confirms if the vulnerability is mitigated.
  • Figure 4: Performance trajectory of the Red and Blue agents across 5 iterations. The convergence toward a high DSR alongside sustained attack intensity validates the framework's effectiveness.
  • Figure 5: A stacked bar chart illustrating the proportions of persuasion outcomes across different LLMs. The bars categorize the results into successful persuasion (shift $>0$, blue), no attitude shift (shift $=0$, orange), and negative shift (shift $<0$, red). Models are sorted by their successful persuasion rate.
  • ...and 12 more figures