Table of Contents
Fetching ...

Mano Technical Report

Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang

TL;DR

Mano tackles the challenge of robust, end-to-end GUI automation by integrating a multimodal foundation model with a simulated data environment and a three-stage training pipeline (SFT, offline RL with GRPO, online RL). The framework is augmented by Mano-parking for autonomous data extraction and Mano-verify for error checking, with specialized modules for authentication (Mano-cipher). Empirical results on Mind2Web and OSWorld demonstrate state-of-the-art performance and strong ablations validate the contribution of online RL, historical context, and closed-loop data cycling. The work highlights the importance of domain-specific data, iterative refinement through RL, and a holistic reward design for bridging gaps in vision-language models for practical GUI agents, with potential for real-world deployment and further enhancements in data acquisition and verification tooling.

Abstract

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

Mano Technical Report

TL;DR

Mano tackles the challenge of robust, end-to-end GUI automation by integrating a multimodal foundation model with a simulated data environment and a three-stage training pipeline (SFT, offline RL with GRPO, online RL). The framework is augmented by Mano-parking for autonomous data extraction and Mano-verify for error checking, with specialized modules for authentication (Mano-cipher). Empirical results on Mind2Web and OSWorld demonstrate state-of-the-art performance and strong ablations validate the contribution of online RL, historical context, and closed-loop data cycling. The work highlights the importance of domain-specific data, iterative refinement through RL, and a holistic reward design for bridging gaps in vision-language models for practical GUI agents, with potential for real-world deployment and further enhancements in data acquisition and verification tooling.

Abstract

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

Paper Structure

This paper contains 26 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of the Mano framework. The left part illustrates the Exploration Module, which operates in simulated browsers and desktop environments to collect interaction elements and candidate goals, generating diverse trajectories and login assistance data for training. The center shows the Inference Process Pipeline, where the model follows a structured “think–act–verify” loop: interpreting GUI states, producing action descriptions (e.g., clicks or type), executing them, and validating outcomes through a verifier. The right part depicts the Optimize Process, a progressive pipeline of SFT, offline RL, and online RL, which systematically strengthens reasoning, adaptability, and end-to-end decision-making in dynamic GUI environments.
  • Figure 2: Overall fine-tuning framework of Mano for GUI-oriented tasks. The pipeline consists of three progressive stages: (i) SFT on offline demonstrations; (ii) Offline RL leveraging static trajectories with reward decomposition; and (iii) Online RL with active environment interaction. The system incorporates step-level reasoning, explicit action description, and operation type selection (e.g., click, drag, type, scroll), while final performance is evaluated through structured outputs and multi-dimensional rewards combining format accuracy, operation correctness, and task completion.
  • Figure 3: Overall framework of online RL in Mano. The Mano model interacts with multiple parallel Playwright instances, each representing a GUI environment. For every step, the model fetches the status and screenshot, performs inference to generate thought and action, and then executes the action within the corresponding environment. The loop continues until the task is completed, while memory traces are recorded and trajectories are exported for further training and analysis.
  • Figure 4: The operational workflow of Mano-parking, which illustrates its autonomous data extraction pipeline. The process begins with request reception and function registry lookup, followed by either direct execution of pre-validated functions or initiation of a multi-phase extraction synthesis. In the latter case, simplified HTML structures are obtained through browser automation and cleaning algorithms, combined with user-defined attribute specifications to generate customized extraction functions. These functions undergo a three-tier validation—field completeness, semantic consistency, and structural integrity—before being executed and stored for reuse. Furthermore, Mano-parking incorporates continuous monitoring and a self-healing mechanism, enabling adaptive regeneration of extraction logic when website structures evolve. This design ensures robustness, efficiency, and minimal human intervention across diverse web environments.
  • Figure 5: Mano-cipher is a specialized authentication GUI model. This GUI model facilitates automated login operations across diverse systems by handling various captcha types—including alphanumeric, image-based sliding, rotation, content recognition, and logical reasoning challenges. Upon successful verification, system control is returned to the Mano for subsequent tasks.
  • ...and 4 more figures