Table of Contents
Fetching ...

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang

Abstract

With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a three-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge by supervised fine-tuning and group relative policy optimization. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 19.72\% in over the strongest baselines, without compromising normal performance. VeriOS-Agent significantly improves performance in untrustworthy scenarios while maintaining comparable performance in trustworthy scenarios. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Abstract

With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a three-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge by supervised fine-tuning and group relative policy optimization. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 19.72\% in over the strongest baselines, without compromising normal performance. VeriOS-Agent significantly improves performance in untrustworthy scenarios while maintaining comparable performance in trustworthy scenarios. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

Paper Structure

This paper contains 27 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Interaction paradigm among the OS agent, human, and GUI. Existing work mainly focuses on autonomous OS agents and confidence-driven interaction OS agents. Our proposed query-driven interaction OS agent achieves human-agent-GUI interaction in untrustworthy scenarios through query-and-answer methods.
  • Figure 2: Untrustworthy scenarios in VeriOS-Bench for OS agents are divided into environment-side and user-side. The environment-side includes environmental anomalies and sensitive actions, while the user-side encompasses information missing and multiple choices.
  • Figure 3: Pilot study on the scenario judgment accuracy of normal MLLM-based OS agents. Existing MLLM-based OS agents perform poorly in identifying untrustworthy scenarios.
  • Figure 4: Diagram of the two-stage learning paradigm and query-driven human-agent-GUI interaction. The two-stage learning paradigm consists of the meta-knowledge decoupling stage and the meta-knowledge utilization stage. We first decouple the knowledge from VeriOS-Bench into scenario knowledge and action knowledge, and then leverage this knowledge to construct VeriOS-Agent. During the interaction process, when VeriOS-Agent identifies the current scenario as untrustworthy, it issues a query to the human and utilizes the history of queries and human responses to better accomplish the task.
  • Figure 5: OOD experiment with 7B and 72B model parameter scales. Experimental results demonstrate that VeriOS-Agent exhibits strong generalization capabilities.