Table of Contents
Fetching ...

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu

TL;DR

The paper identifies critical security flaws in the prevailing Screen-as-Interface model for LLM-powered mobile agents and introduces Aura, a secure Agent Universal Runtime Architecture with an OS-resident Agent Kernel. Aura replaces visual scraping with a structured, intent-driven A2A interaction in a Hub-and-Spoke topology, anchored by cryptographic identity (GAR/AIC), a Semantic Firewall, taint-aware memory, and auditable execution. Empirical evaluation on MobileSafetyBench shows Aura substantially improves low-risk task success (≈94.3% vs ≈75%), drastically reduces high-risk attack success (≈4.4% vs >40%), and achieves near an order-of-magnitude latency reduction by eliminating GUI processing bottlenecks. The work argues that secure agent-native OS design enables an effective Agent Economy, with robust accountability and cross-device potential, marking a practical path beyond GUI-grounded mobile agents toward secure, scalable agent-native systems.

Abstract

The evolution of Large Language Models (LLMs) has shifted mobile computing from App-centric interactions to system-level autonomous agents. Current implementations predominantly rely on a "Screen-as-Interface" paradigm, which inherits structural vulnerabilities and conflicts with the mobile ecosystem's economic foundations. In this paper, we conduct a systematic security analysis of state-of-the-art mobile agents using Doubao Mobile Assistant as a representative case. We decompose the threat landscape into four dimensions - Agent Identity, External Interface, Internal Reasoning, and Action Execution - revealing critical flaws such as fake App identity, visual spoofing, indirect prompt injection, and unauthorized privilege escalation stemming from a reliance on unstructured visual data. To address these challenges, we propose Aura, an Agent Universal Runtime Architecture for a clean-slate secure agent OS. Aura replaces brittle GUI scraping with a structured, agent-native interaction model. It adopts a Hub-and-Spoke topology where a privileged System Agent orchestrates intent, sandboxed App Agents execute domain-specific tasks, and the Agent Kernel mediates all communication. The Agent Kernel enforces four defense pillars: (i) cryptographic identity binding via a Global Agent Registry; (ii) semantic input sanitization through a multilayer Semantic Firewall; (iii) cognitive integrity via taint-aware memory and plan-trajectory alignment; and (iv) granular access control with non-deniable auditing. Evaluation on MobileSafetyBench shows that, compared to Doubao, Aura improves low-risk Task Success Rate from roughly 75% to 94.3%, reduces high-risk Attack Success Rate from roughly 40% to 4.4%, and achieves near-order-of-magnitude latency gains. These results demonstrate Aura as a viable, secure alternative to the "Screen-as-Interface" paradigm.

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

TL;DR

The paper identifies critical security flaws in the prevailing Screen-as-Interface model for LLM-powered mobile agents and introduces Aura, a secure Agent Universal Runtime Architecture with an OS-resident Agent Kernel. Aura replaces visual scraping with a structured, intent-driven A2A interaction in a Hub-and-Spoke topology, anchored by cryptographic identity (GAR/AIC), a Semantic Firewall, taint-aware memory, and auditable execution. Empirical evaluation on MobileSafetyBench shows Aura substantially improves low-risk task success (≈94.3% vs ≈75%), drastically reduces high-risk attack success (≈4.4% vs >40%), and achieves near an order-of-magnitude latency reduction by eliminating GUI processing bottlenecks. The work argues that secure agent-native OS design enables an effective Agent Economy, with robust accountability and cross-device potential, marking a practical path beyond GUI-grounded mobile agents toward secure, scalable agent-native systems.

Abstract

The evolution of Large Language Models (LLMs) has shifted mobile computing from App-centric interactions to system-level autonomous agents. Current implementations predominantly rely on a "Screen-as-Interface" paradigm, which inherits structural vulnerabilities and conflicts with the mobile ecosystem's economic foundations. In this paper, we conduct a systematic security analysis of state-of-the-art mobile agents using Doubao Mobile Assistant as a representative case. We decompose the threat landscape into four dimensions - Agent Identity, External Interface, Internal Reasoning, and Action Execution - revealing critical flaws such as fake App identity, visual spoofing, indirect prompt injection, and unauthorized privilege escalation stemming from a reliance on unstructured visual data. To address these challenges, we propose Aura, an Agent Universal Runtime Architecture for a clean-slate secure agent OS. Aura replaces brittle GUI scraping with a structured, agent-native interaction model. It adopts a Hub-and-Spoke topology where a privileged System Agent orchestrates intent, sandboxed App Agents execute domain-specific tasks, and the Agent Kernel mediates all communication. The Agent Kernel enforces four defense pillars: (i) cryptographic identity binding via a Global Agent Registry; (ii) semantic input sanitization through a multilayer Semantic Firewall; (iii) cognitive integrity via taint-aware memory and plan-trajectory alignment; and (iv) granular access control with non-deniable auditing. Evaluation on MobileSafetyBench shows that, compared to Doubao, Aura improves low-risk Task Success Rate from roughly 75% to 94.3%, reduces high-risk Attack Success Rate from roughly 40% to 4.4%, and achieves near-order-of-magnitude latency gains. These results demonstrate Aura as a viable, secure alternative to the "Screen-as-Interface" paradigm.
Paper Structure (50 sections, 3 equations, 15 figures, 3 tables)

This paper contains 50 sections, 3 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Overview of our lifecycle-based security analysis for GUI-based mobile agents, mapping the adversarial attack surface onto four phases (i.e., Identity & Trust Anchoring, Perceptual Authenticity & Safety, Cognitive Security, and Action Access Control & Accountability) across the mobile OS control stack and highlighting the key vulnerabilities at each layer.
  • Figure 2: Identity Confusion Test: The agent fails to distinguish between the legitimate WeChat App (Gmail App) and a fake counterpart, successfully launching the fake App upon user request.
  • Figure 3: Visual Hallucination Test: The agent misidentifies a visual element on a web page or inside an App, leading to incorrect interaction planning.
  • Figure 4: Phishing Susceptibility Test: The user prompts the agent to check a recently received email. The agent navigates to the email content, clicks on a link without verifying the URL, and navigates to a potentially malicious site without warning.
  • Figure 5: Stealthy "In-App Helper" Attack: A malicious accessibility-based helper App monitors the agent's Gmail task on the agent-hosted virtual display, injects a Gmail-specific "Network Lag Detected" overlay that appears only in that virtual display, and poisons the agent's plan so that clicking the seemingly benign Refresh button actually triggers a hidden background action (e.g., confirming a subscription or enabling data exfiltration).
  • ...and 10 more figures