Mano Technical Report

Tianyu Fu; Anyang Su; Chenxu Zhao; Hanning Wang; Minghui Wu; Zhe Yu; Fei Hu; Mingjia Shi; Wei Dong; Jiayao Wang; Yuyang Chen; Ruiyang Yu; Siran Peng; Menglin Li; Nan Huang; Haitian Wei; Jiawei Yu; Yi Xin; Xilin Zhao; Kai Gu; Ping Jiang; Sifan Zhou; Shuo Wang

Mano Technical Report

Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang

TL;DR

Mano tackles the challenge of robust, end-to-end GUI automation by integrating a multimodal foundation model with a simulated data environment and a three-stage training pipeline (SFT, offline RL with GRPO, online RL). The framework is augmented by Mano-parking for autonomous data extraction and Mano-verify for error checking, with specialized modules for authentication (Mano-cipher). Empirical results on Mind2Web and OSWorld demonstrate state-of-the-art performance and strong ablations validate the contribution of online RL, historical context, and closed-loop data cycling. The work highlights the importance of domain-specific data, iterative refinement through RL, and a holistic reward design for bridging gaps in vision-language models for practical GUI agents, with potential for real-world deployment and further enhancements in data acquisition and verification tooling.

Abstract

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

Mano Technical Report

TL;DR

Abstract

Mano Technical Report

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)