VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou; Zhexiao Huang; Yuan Guo; Zhangxuan Gu; Tianyu Xia; Zichen Luo; Fei Tang; Dehan Kong; Yanyi Shang; Suling Ou; Zhenlin Guo; Changhua Meng; Shuheng Shen

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

TL;DR

VenusBench-GD presents the largest cross-platform GUI grounding benchmark to date, designed for web, mobile, and desktop evaluation in English and Chinese. It introduces a six-task hierarchical framework (three basic, three advanced) over 97 apps and 6,166 annotated image-instruction pairs across 13 UI element types. A bottom-up data pipeline combines raw data collection, detector-based element localization, and ML-generated instruction prompts with multi-stage quality filtering to ensure high annotation fidelity. Experiments reveal that general multimodal models now rival or surpass GUI-specialized models on basic grounding, while advanced tasks continue to favor GUI-specific approaches, underscoring the need for a multi-tiered evaluation framework.

Abstract

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

TL;DR

Abstract

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)

Theorems & Definitions (2)