Table of Contents
Fetching ...

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu

TL;DR

This work tackles the lack of high-quality Chinese GUI navigation data by introducing E-ANT, a large-scale dataset with ~40k real human trajectories across 20k+ tiny apps, enriched with intents, per-step screenshots, and detailed layout annotations. It systematically evaluates zero-shot and fine-tuned LLM/MLLM approaches and introduces data augmentation to improve page understanding and chained decision-making data. The results reveal significant challenges in GUI navigation, especially at the trajectory level, and demonstrate how layout-aware representations and targeted augmentation can boost per-image decisions. Overall, E-ANT provides a valuable benchmark and baseline methods to advance evaluation and development of GUI navigation and multimodal decision-making for mobile apps, particularly in Chinese contexts.

Abstract

Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

TL;DR

This work tackles the lack of high-quality Chinese GUI navigation data by introducing E-ANT, a large-scale dataset with ~40k real human trajectories across 20k+ tiny apps, enriched with intents, per-step screenshots, and detailed layout annotations. It systematically evaluates zero-shot and fine-tuned LLM/MLLM approaches and introduces data augmentation to improve page understanding and chained decision-making data. The results reveal significant challenges in GUI navigation, especially at the trajectory level, and demonstrate how layout-aware representations and targeted augmentation can boost per-image decisions. Overall, E-ANT provides a valuable benchmark and baseline methods to advance evaluation and development of GUI navigation and multimodal decision-making for mobile apps, particularly in Chinese contexts.

Abstract

Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.
Paper Structure (20 sections, 5 figures, 1 table)

This paper contains 20 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Left: Annotations of AitW. Middle: Our page analysis results on AitW data. Right: Our page analysis results on our dataset. Note: Our method accurately identifies the "Select Specifications" button, critical for order placements.
  • Figure 2: Sample Trajectory in Our Dataset: Navigating to a Merchant's Homepage in a Tiny-App (Red Dot Indicates Click Location in the Current State).
  • Figure 3: Two steps of sampling from a trajectory: Data overview and data production pipeline for each step in the dataset.
  • Figure 4: We illustrate key characteristics of our dataset. (a) provides a overview of trajectories distribution. Most trajectories ranges less than 10 steps, longer trajectories tend to have lower success rate.(b) Our dataset is collected from over 20,000 distinct tiny apps and URLs with very low repetition and high variability. (c) displays the distribution of event type at each individual step, which gathered over 14,0000 single step operations. (4) all tiny apps can be classified into 25 different merchant categories, ranging from release industry to tool, indicating diversity in operation.
  • Figure 5: Pipeline for executing data augmentation methods on E-ANT.