Table of Contents
Fetching ...

SecAgent: Efficient Mobile GUI Agent with Semantic Context

Yiping Xie, Song Chen, Jingxuan Xing, Wei Jiang, Zekun Zhu, Yingyao Wang, Pi Bu, Jun Song, Yuning Jiang, Bo Zheng

TL;DR

This work constructs a human-verified Chinese mobile GUI dataset, and proposes a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information.

Abstract

Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.

SecAgent: Efficient Mobile GUI Agent with Semantic Context

TL;DR

This work constructs a human-verified Chinese mobile GUI dataset, and proposes a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information.

Abstract

Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.
Paper Structure (23 sections, 6 equations, 9 figures, 9 tables)

This paper contains 23 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Performance comparison of various agent models on three representative navigation benchmarks. SecAgent-3B achieves superior performance to 3B models and comparable performance to 7B-8B models.
  • Figure 2: Collection pipeline of our CMGUI dataset. Grounding data: UI elements are selected from random walk episodes, with MLLMs generating instructions. Navigation data: episodes are collected using a hybrid human-agent strategy, then humans annotate action correctness and bounding boxes, while MLLMs annotate semantic context and thoughts.
  • Figure 3: a) Samples from the GUIOdyssey dataset lu2024gui, where SAM2 ravi2024sam is used to segment UI elements to obtain bounding boxes, inevitably leading to noise and errors (e.g., the red box is incorrectly sized). b) Samples from our CMGUI data. We employ well-trained annotators to conduct box-by-box annotation and implement a double-check mechanism to ensure precise bounding boxes. c) For our CMGUI-Bench, we conduct step-by-step and box-by-box annotation and consider all reasonable actions at each step, producing multi-choice action annotation.
  • Figure 4: Comparison of different methods. Left. a) Baseline: take the instruction and the current screenshot as input and directly output the action. b) AgentCPM zhang2025agentcpm and UI-R1 lu2025ui expect the model to have reasoning capabilities and thus introduce thought before output action. c) UI-Venus qin2025ui, GUI-R1 luo2025gui, and OS-Atlas wu2024atlas add history actions to the input. d) OdysseyAgent lu2024gui uses both history actions and screenshots in the input. To ensure optimal results, it uses a large number of historical screenshots, for example, $N=5$. e) Our SecAgent records history information by maintaining a concise and natural language-based summary. Thanks to this, SecAgent can achieve good results using only a historical image and action. Right. A demonstration of SecAgent.
  • Figure 5: The training framework of SecAgent. We observe that the original grounding and navigation datasets exhibit inherent biases. To mitigate this issue, we do not directly utilize the grounding data for SFT training as in prior work zhang2025agentcpmzhang2025tongui. Instead, we transform the grounding annotations into navigation-compatible format by treating the center point of each bounding box as the target click coordinate.
  • ...and 4 more figures