TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents
Kunal Singh, Shreyas Singh, Mukund Khanna
TL;DR
TRISHUL addresses the challenge of generalizable GUI understanding for large vision-language models by proposing a training-free agentic framework. It introduces Hierarchical Screen Parsing (HSP) to create a multi-granular GUI representation and SEED to generate spatially and semantically enriched element descriptions, enabling robust action grounding and GUI referring. Across ScreenSpot, VisualWebBench, Mind2Web, AITW, and ScreenPR, TRISHUL consistently outperforms training-free baselines and rivals training-based methods, with ablations confirming the critical roles of GROIs and SEED. The approach enhances cross-domain generalization and offers practical benefits for accessibility and automated GUI interactions, supported by extensive human evaluations and multi-candidate grounding analyses.
Abstract
Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
