Table of Contents
Fetching ...

TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments

Yuheng Lu, Qian Yu, Hongru Wang, Zeming Liu, Wei Su, Yanping Liu, Yuhang Guo, Maocheng Liang, Yunhong Wang, Haifeng Wang

TL;DR

This work introduces TransBench, the first benchmark specifically designed to evaluate and enhance the transferability of GUI grounding across cross-version, cross-platform, and cross-application dimensions. It builds a multi-platform, multi-version data pipeline with 81 apps, 1,459 screenshots, and over 65,000 bounding boxes to support robust grounding evaluation, plus 22,000+ grounding instructions with high quality verified by humans. Across diverse GUI models, Qwen2.5VL achieves the best grounding accuracy while UGround often yields the smallest localization distance, and fine-tuning on older versions markedly improves cross-version performance. The results reveal substantial transferability gaps, particularly across Web, and demonstrate the practical potential of transferable GUI agents for real-world dynamic environments, while also acknowledging computational and data-efficiency limitations.

Abstract

Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding - the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at GitHub.

TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments

TL;DR

This work introduces TransBench, the first benchmark specifically designed to evaluate and enhance the transferability of GUI grounding across cross-version, cross-platform, and cross-application dimensions. It builds a multi-platform, multi-version data pipeline with 81 apps, 1,459 screenshots, and over 65,000 bounding boxes to support robust grounding evaluation, plus 22,000+ grounding instructions with high quality verified by humans. Across diverse GUI models, Qwen2.5VL achieves the best grounding accuracy while UGround often yields the smallest localization distance, and fine-tuning on older versions markedly improves cross-version performance. The results reveal substantial transferability gaps, particularly across Web, and demonstrate the practical potential of transferable GUI agents for real-world dynamic environments, while also acknowledging computational and data-efficiency limitations.

Abstract

Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding - the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at GitHub.

Paper Structure

This paper contains 53 sections, 1 equation, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Interpretation of Transferability's three aspects. Green means cross-version transferability: transferring the knowledge learned from the homepage of Jingdong (a Chinese shopping app) from Android version 12.0.0 to a newer Android version, 13.6.8. Red means cross-platform transferability: transferring from the Android version of Jingdong to its iOS version 13.8.1 and Web version. Blue means cross-application transferability: transferring from Jingdong to other apps with the same functionality (e.g., shopping: Pinduoduo) or with different functionality (e.g., Finance: Bank of China)
  • Figure 2: Interpretation of data collection process. The blue box represents our proposed benchmark -TransBench, which consists of three parts: ScreenShot Acquisition, Annotating Bounding Boxes, and Annotating Grounding Instructions. Platform means iOS, Android, and Web. Page names are manually divided into page names according to human semantics, such as "Shopping cart," "My page," "Home," "Comments," and so on, which usually have similar functions.
  • Figure 3: Sub-figure (a), (b) shows the variation of average accuracy and average distance after finetuning Aria-ui on App Split. "CAT" means category.
  • Figure 4: An example of Android version.
  • Figure 5: An example of iOS version.
  • ...and 4 more figures