MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

SeokJoo Kwak; Jihoon Kim; Boyoun Kim; Jung Jae Yoon; Wooseok Jang; Jeonghoon Hong; Jaeho Yang; Yeong-Dae Kwon

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang, Jeonghoon Hong, Jaeho Yang, Yeong-Dae Kwon

TL;DR

GUI grounding maps natural language commands to screen coordinates, a task made hard by vast display scales and semantic ambiguity. MEGA-GUI addresses this with a modular, multi stage framework that decouples coarse ROI search from fine grained grounding, guided by a bidirectional ROI zoom and a context aware instruction rewriting module. The approach achieves state of the art on ScreenSpot-Pro and OSWorld-G benchmarks, with 73.18% and 68.63% accuracy respectively, and is supported by targeted ablations and a publicly released Grounding Benchmark Toolkit. By enabling specialized agents to solve sub tasks, MEGA-GUI offers robust performance gains, open research tooling, and a path toward safer, more accessible GUI automation systems.

Abstract

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

TL;DR

Abstract

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)