Table of Contents
Fetching ...

Explorer: Robust Collection of Interactable GUI Elements

Iason Chaimalas, Arnas Vyšniauskas, Gabriel Brostow

TL;DR

Explorer addresses the challenge of collecting reliable, target-domain GUI data to train robust automation agents. It introduces a three-part system—an interactable detector, a screen-similarity model, and an action-matching component—paired with a trace-enabled workflow and voice-navigation capability, all designed to run across desktop websites and Android apps without platform-specific APIs. Across targeted GUIs, Explorer achieves strong per-GUI detection, efficient state-aware similarity, and cross-device trace replication, with evidence from multiple apps and platforms. The work illustrates practical gains in hands-free GUI traversal and dataset open-sourcing to foster broader adoption and extension in real-world accessibility and automation tasks.

Abstract

Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

Explorer: Robust Collection of Interactable GUI Elements

TL;DR

Explorer addresses the challenge of collecting reliable, target-domain GUI data to train robust automation agents. It introduces a three-part system—an interactable detector, a screen-similarity model, and an action-matching component—paired with a trace-enabled workflow and voice-navigation capability, all designed to run across desktop websites and Android apps without platform-specific APIs. Across targeted GUIs, Explorer achieves strong per-GUI detection, efficient state-aware similarity, and cross-device trace replication, with evidence from multiple apps and platforms. The work illustrates practical gains in hands-free GUI traversal and dataset open-sourcing to foster broader adoption and extension in real-world accessibility and automation tasks.

Abstract

Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

Paper Structure

This paper contains 33 sections, 2 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Visualization of data collection and auto-labeling, with ground-truth bounding boxes in green. Personal data is covered (orange).
  • Figure 2: Visualization of data labeling for the Screen Similarity task, with same-state GUI screens labeled in the same group. Screenshots labeled to different groups (e.g. Group 2 and 4) are different GUI states. Hence, note that training labels ("same" or "different" state) for the Siamese network can be inferred based on group membership.
  • Figure 3: Visual change induced after hovering over the "Teachers" interactable with a mouse, on the KhanAcademy home website. Similar changes occur for desktop-computer GUI elements, either on or sometimes around the interactable, e.g. tooltips.
  • Figure 4: Examples of two common occlusions in the Android Spotify app. Occluded interactables' bbox retrieved from the Android Accessibility Tree is illustrated with yellow dashed-lines. Compare this to the truncated solid-yellow bbox above the bottom song-banner, which is the true size shape of that interactable. Personal information is covered (orange).
  • Figure 5: Screen Similarity Model trained using a Siamese Network on pairs of our webpages labeled as "same" or "different" states. KhanSimilarity denotes our dataset of KhanAcademy screenshots for screen similarity.
  • ...and 12 more figures