Table of Contents
Fetching ...

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

TL;DR

VenusBench-GD presents the largest cross-platform GUI grounding benchmark to date, designed for web, mobile, and desktop evaluation in English and Chinese. It introduces a six-task hierarchical framework (three basic, three advanced) over 97 apps and 6,166 annotated image-instruction pairs across 13 UI element types. A bottom-up data pipeline combines raw data collection, detector-based element localization, and ML-generated instruction prompts with multi-stage quality filtering to ensure high annotation fidelity. Experiments reveal that general multimodal models now rival or surpass GUI-specialized models on basic grounding, while advanced tasks continue to favor GUI-specific approaches, underscoring the need for a multi-tiered evaluation framework.

Abstract

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

TL;DR

VenusBench-GD presents the largest cross-platform GUI grounding benchmark to date, designed for web, mobile, and desktop evaluation in English and Chinese. It introduces a six-task hierarchical framework (three basic, three advanced) over 97 apps and 6,166 annotated image-instruction pairs across 13 UI element types. A bottom-up data pipeline combines raw data collection, detector-based element localization, and ML-generated instruction prompts with multi-stage quality filtering to ensure high annotation fidelity. Experiments reveal that general multimodal models now rival or surpass GUI-specialized models on basic grounding, while advanced tasks continue to favor GUI-specific approaches, underscoring the need for a multi-tiered evaluation framework.

Abstract

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

Paper Structure

This paper contains 20 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The mode performance of representative GUI grounding models. Notably, model performance on advanced grounding tasks are significantly lower than on basic tasks, highlighting the increased difficulty and reasoning demands of the former.
  • Figure 2: The overview of VenusBench-GD benchmark. VenusBench-GD integrates basic and advanced grounding tasks to comprehensively evaluation the capabilities of existing GUI models as shown above. Basic tasks assess the ability to recognize local UI elements, while advanced tasks require holistic reasoning over the entire interface and its underlying application functionality, demanding a more complex and global understanding.
  • Figure 3: The domain distribution of our grounding benchmark. VenusBench-GD spans 97 distinct apps, software, and websites across desktop, mobile, and web platforms, ensuring diverse and comprehensive coverage. We consolidate representations of the same software across platforms into one entry for clarity.
  • Figure 4: Examples of inaccurate annotations in existing benchmarks.
  • Figure 5: Thinking-enabled model makes the correct grounding action with detailed analysis of the whole screenshot.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2