E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu
TL;DR
This work tackles the lack of high-quality Chinese GUI navigation data by introducing E-ANT, a large-scale dataset with ~40k real human trajectories across 20k+ tiny apps, enriched with intents, per-step screenshots, and detailed layout annotations. It systematically evaluates zero-shot and fine-tuned LLM/MLLM approaches and introduces data augmentation to improve page understanding and chained decision-making data. The results reveal significant challenges in GUI navigation, especially at the trajectory level, and demonstrate how layout-aware representations and targeted augmentation can boost per-image decisions. Overall, E-ANT provides a valuable benchmark and baseline methods to advance evaluation and development of GUI navigation and multimodal decision-making for mobile apps, particularly in Chinese contexts.
Abstract
Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.
