Table of Contents
Fetching ...

TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

Kunal Singh, Shreyas Singh, Mukund Khanna

TL;DR

TRISHUL addresses the challenge of generalizable GUI understanding for large vision-language models by proposing a training-free agentic framework. It introduces Hierarchical Screen Parsing (HSP) to create a multi-granular GUI representation and SEED to generate spatially and semantically enriched element descriptions, enabling robust action grounding and GUI referring. Across ScreenSpot, VisualWebBench, Mind2Web, AITW, and ScreenPR, TRISHUL consistently outperforms training-free baselines and rivals training-based methods, with ablations confirming the critical roles of GROIs and SEED. The approach enhances cross-domain generalization and offers practical benefits for accessibility and automated GUI interactions, supported by extensive human evaluations and multi-candidate grounding analyses.

Abstract

Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.

TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

TL;DR

TRISHUL addresses the challenge of generalizable GUI understanding for large vision-language models by proposing a training-free agentic framework. It introduces Hierarchical Screen Parsing (HSP) to create a multi-granular GUI representation and SEED to generate spatially and semantically enriched element descriptions, enabling robust action grounding and GUI referring. Across ScreenSpot, VisualWebBench, Mind2Web, AITW, and ScreenPR, TRISHUL consistently outperforms training-free baselines and rivals training-based methods, with ablations confirming the critical roles of GROIs and SEED. The approach enhances cross-domain generalization and offers practical benefits for accessibility and automated GUI interactions, supported by extensive human evaluations and multi-candidate grounding analyses.

Abstract

Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.

Paper Structure

This paper contains 24 sections, 4 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Screen parsing results showing detected GUI elements and their function descriptors leveraging our HSP and SEED modules
  • Figure 2: TRISHUL: Agentic Action Grounding Framework, Pink arrow, denotes our Hierarchical Screen Parsing (HSP) method, to generate GROIs and local element annotations, Green arrows represent our Spatially Enhanced Element Descriptor (SEED) workflow, Blue arrows represent our GROI proposal framework and Magenta Arrow shows, the Set of Marks (SoM) based Grounding workflow.
  • Figure 3: TRISHUL: Agentic GUI Referring Framework, the 2 Lenses created using our HSP module for local and global context. Lens-1 contains the local element (blue) in the cropped GROI (red), Lens-2 contains the GROI (blue) in the full input screenshot (red).The selected point is represented as the black dot. Both lenses are fed to the LVLM to generate Layout and Task description.
  • Figure 4: Human evaluation results on ScreenPR benchmark. TRISHUL is preferred by human annotators 63% of the time over ToL agent and 73% of the time over baseline GPT-4o
  • Figure 5: Local Element Exhaustiveness Score for ScreenSpot, Visual WebBench, AITW and Mind2Web
  • ...and 9 more figures