Table of Contents
Fetching ...

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

Hongze Mi, Yibo Feng, Wenjie Lu, Yuqi Wang, Jinyuan Li, Song Cao, He Cui, Tengfei Tian, Xuelin Zhang, Haotian Luo, Di Sun, Naiqiang Tan, Gang Pan

TL;DR

D-Artemis addresses key challenges in mobile GUI agents—data bottlenecks, delayed error detection, and conflicting guidance—by embedding a deliberative cognitive loop that combines app-specific tip retrieval with proactive pre-execution alignment (TAC and ACA) and post-execution reflection (SRA). The framework enables general-purpose Multimodal LLMs to perform GUI tasks with strong generalization, achieving SOTA results on AndroidWorld (75.8%) and ScreenSpot-V2 (96.8%) without training on GUI trajectories. Ablation shows each component contributes significantly, with TAC and ACA jointly reducing errors and TAC providing an effective error-filtering layer, while tip retrieval reduces guidance noise. The results underline the practicality of data-efficient, robust GUI automation and offer a blueprint for extending deliberative, multi-agent reasoning to other task domains.

Abstract

Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

TL;DR

D-Artemis addresses key challenges in mobile GUI agents—data bottlenecks, delayed error detection, and conflicting guidance—by embedding a deliberative cognitive loop that combines app-specific tip retrieval with proactive pre-execution alignment (TAC and ACA) and post-execution reflection (SRA). The framework enables general-purpose Multimodal LLMs to perform GUI tasks with strong generalization, achieving SOTA results on AndroidWorld (75.8%) and ScreenSpot-V2 (96.8%) without training on GUI trajectories. Ablation shows each component contributes significantly, with TAC and ACA jointly reducing errors and TAC providing an effective error-filtering layer, while tip retrieval reduces guidance noise. The results underline the practicality of data-efficient, robust GUI automation and offer a blueprint for extending deliberative, multi-agent reasoning to other task domains.

Abstract

Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

Paper Structure

This paper contains 37 sections, 6 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: D-Artemis framework emulates the human cognitive loop of learning, planning, calibration, and reflection.
  • Figure 2: Overview of the D-Artemis framework. (a) The manager agent is guided by two input modalities: textual (task, tips, working memory) and visual (screenshot only). (b) Pre-execution, TAC Check module verifies thought-action consistency. (c) A low consistency score triggers the Action Correction Agent (ACA) to analyze the error type and rectify the action. (d) Post-execution, the Status Reflection Agent (SRA) assesses the action effectiveness and the environmental state to produce guidance for the next step. Upon completion of each step, the working memory is updated.
  • Figure 3: Ablation study on AndroidWorld.
  • Figure 4: Success rates of different tip guidance strategies across AndroidWorld applications.
  • Figure 5: The statistic of Error Cases on AndroidWorld that D-Artemis failed to complete. Qwen2.5-VL-72B is used as both the baseline model and the foundational LLM backbone for D-Artemis. Method marked with "$\dag$" uses GUI-OWL-32B as backbone.
  • ...and 16 more figures