Table of Contents
Fetching ...

UI-Venus-1.5 Technical Report

Veuns-Team, :, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang

TL;DR

UI-Venus-1.5 presents a unified end-to-end GUI agent trained with a four-stage pipeline—Mid-Training, Offline-RL, Online-RL, and Model Merge—achieving state-of-the-art results on multiple GUI grounding and navigation benchmarks and robust real-world navigation in Chinese apps. It introduces a 10B-token Mid-Training corpus across 30+ GUI datasets, a scalable online RL framework via Device-as-a-Service (DaaS), and a model merging strategy to fuse domain-specific grounding, web, and mobile capabilities into a single checkpoint. Empirical results demonstrate strong performance across ScreenSpot-Pro, VenusBench-GD, AndroidWorld, WebVoyager, and other benchmarks, with notable gains from model scaling and an ablation study highlighting the value of each training stage. Collectively, the work delivers a practical, end-to-end GUI assistant capable of autonomously executing complex tasks across diverse web and mobile environments, including 40+ Chinese apps, while maintaining deployment efficiency.

Abstract

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios.Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus

UI-Venus-1.5 Technical Report

TL;DR

UI-Venus-1.5 presents a unified end-to-end GUI agent trained with a four-stage pipeline—Mid-Training, Offline-RL, Online-RL, and Model Merge—achieving state-of-the-art results on multiple GUI grounding and navigation benchmarks and robust real-world navigation in Chinese apps. It introduces a 10B-token Mid-Training corpus across 30+ GUI datasets, a scalable online RL framework via Device-as-a-Service (DaaS), and a model merging strategy to fuse domain-specific grounding, web, and mobile capabilities into a single checkpoint. Empirical results demonstrate strong performance across ScreenSpot-Pro, VenusBench-GD, AndroidWorld, WebVoyager, and other benchmarks, with notable gains from model scaling and an ablation study highlighting the value of each training stage. Collectively, the work delivers a practical, end-to-end GUI assistant capable of autonomously executing complex tasks across diverse web and mobile environments, including 40+ Chinese apps, while maintaining deployment efficiency.

Abstract

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios.Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus
Paper Structure (38 sections, 8 equations, 8 figures, 16 tables)

This paper contains 38 sections, 8 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: UI-Venus-1.5 achieves SOTA performance across multiple GUI grounding and navigation benchmarks. Note that in the three radar charts of grounding, we have normalized the results of the top-performing model to 100% to more clearly differentiate comparisons among various baselines.
  • Figure 2: System Overview of UI-Venus-1.5. It operates as an end-to-end GUI Agent that interprets user instructions, perceives interface states through screenshots, and executes interactive actions (e.g., clicking, typing, scrolling) to accomplish tasks across diverse executable environments.
  • Figure 3: The Four-Stage Pipeline of UI-Venus-1.5. Starting from Qwen3-VL Series, the model progresses through a multi-stage curriculum: (1) Mid-Training on large-scale GUI data for domain knowledge injection; (2) Offline-RL for task-specific optimization across grounding, mobile, and web objectives; (3) Online-RL to enhance navigation in complex, real-world settings; and (4) Model Merge, which unifies the specialized models into the final UI-Venus-1.5.
  • Figure 4: (a) The inner part represents the functional task categories (e.g., GUI-VQA, Grounding, Perception), while the outer one details the distribution of specific data sources and target platforms (Web, Desktop, Mobile); (b) Iterative data refinement pipeline with teacher scoring, trace rewriting/reconstruction, and manual verification.
  • Figure 5: Data generation loop via DaaS environment. By iteratively performing this pipeline, the success rate of total trace generation raises from 17.9% to over 70%.
  • ...and 3 more figures