Table of Contents
Fetching ...

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun

TL;DR

AgentCPM-GUI tackles robust, multilingual mobile GUI interaction by training an 8B vision-language model through a three-stage pipeline: grounding pre-training for perception, supervised imitation learning for action priors, and reinforcement fine-tuning with GRPO to improve long-horizon reasoning. A compact six-action space and efficient JSON encoding enable on-device, low-latency execution, while a large Chinese-focused dataset (55K trajectories) augmented with English data enhances cross-lingual generalization and realism. The approach achieves state-of-the-art results on multiple benchmarks, including the new CAGUI Chinese GUI benchmark with 96.9% Type-Match and 91.3% Exact-Match, and is openly released for reproducibility. Collectively, the work advances scalable, multilingual GUI agents suitable for real-world mobile automation and accessibility tasks.

Abstract

The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

TL;DR

AgentCPM-GUI tackles robust, multilingual mobile GUI interaction by training an 8B vision-language model through a three-stage pipeline: grounding pre-training for perception, supervised imitation learning for action priors, and reinforcement fine-tuning with GRPO to improve long-horizon reasoning. A compact six-action space and efficient JSON encoding enable on-device, low-latency execution, while a large Chinese-focused dataset (55K trajectories) augmented with English data enhances cross-lingual generalization and realism. The approach achieves state-of-the-art results on multiple benchmarks, including the new CAGUI Chinese GUI benchmark with 96.9% Type-Match and 91.3% Exact-Match, and is openly released for reproducibility. Collectively, the work advances scalable, multilingual GUI agents suitable for real-world mobile automation and accessibility tasks.

Abstract

The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching Type-Match and Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.

Paper Structure

This paper contains 53 sections, 2 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Overview of our training framework.
  • Figure 2: Reward curves on the training and validation sets of AgentCPM-GUI.
  • Figure 3: A demo case on the BiliBili.
  • Figure 4: A demo case on the NetEase Cloud Music.