Table of Contents
Fetching ...

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, Chaowei Xiao

TL;DR

This work introduces AdvCUA, the first MITRE ATT&CK–aligned benchmark for evaluating real-world OS threats posed by Computer-use Agents (CUAs). By coupling a multi-host enterprise-like sandbox with hard-coded verification, AdvCUA encompasses 140 tasks (40 direct, 74 TTP-based, 26 end-to-end kill chains) and evaluates five terminal-interfacing CUAs across eight foundation LLMs. The study shows that current frontier CUAs underrepresent OS-centric threats, with substantial risk emerging from TTP-based tasks and end-to-end kill chains, even in the presence of jailbreaks or guardrails. The authors release AdvCUA to catalyze safer CUA design and more robust evaluation, highlighting the need for improved safety alignment in real-world agent systems.

Abstract

Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

TL;DR

This work introduces AdvCUA, the first MITRE ATT&CK–aligned benchmark for evaluating real-world OS threats posed by Computer-use Agents (CUAs). By coupling a multi-host enterprise-like sandbox with hard-coded verification, AdvCUA encompasses 140 tasks (40 direct, 74 TTP-based, 26 end-to-end kill chains) and evaluates five terminal-interfacing CUAs across eight foundation LLMs. The study shows that current frontier CUAs underrepresent OS-centric threats, with substantial risk emerging from TTP-based tasks and end-to-end kill chains, even in the presence of jailbreaks or guardrails. The authors release AdvCUA to catalyze safer CUA design and more robust evaluation, highlighting the need for improved safety alignment in real-world agent systems.

Abstract

Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.

Paper Structure

This paper contains 57 sections, 2 equations, 30 figures, 7 tables.

Figures (30)

  • Figure 1: End-to-end kill chain of Gemini CLI based on Gemini 2.5 Pro. Gemini CLI helps attacker abuse a SUID binary for Privilege Escalation, spawning a root shell and pivoting via SSH to solidify control. With root access it performs Credential Dumping by grabbing /etc/passwd and /etc/shadow, then runs Password Cracking by John the Ripper to recover all users' plaintext password in enterprise OS.
  • Figure 2: Compare with Existing Work. Our attack goals are more diverse and align with real-world adversaries, and the environment with encrypted user credentials is more realistic.
  • Figure 3: Pipeline for Data Generation. (1) We enumerate the MITRE ATT&CK framework and filter those Techniques feasible on Ubuntu 22.04 in Docker, audit each to define malicious goals, (2) combine goals with MITRE ATT&CK techniques to TTP-based malicious requests, implement and validate in sandbox with hard-coded checks, iteratively refine via expert–LLMs collaboration, thus build AdvCUA.
  • Figure 4: Comparison of Threat: Left is over 1 attempt. Right is over 5 attempts. TTP-based tasks pose a higher Threat than Direct tasks; both results show that five-attempt ASR exceeds single-attempt ASR.
  • Figure 5: ASR on different Tactics
  • ...and 25 more figures