Table of Contents
Fetching ...

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang, Jeonghoon Hong, Jaeho Yang, Yeong-Dae Kwon

TL;DR

GUI grounding maps natural language commands to screen coordinates, a task made hard by vast display scales and semantic ambiguity. MEGA-GUI addresses this with a modular, multi stage framework that decouples coarse ROI search from fine grained grounding, guided by a bidirectional ROI zoom and a context aware instruction rewriting module. The approach achieves state of the art on ScreenSpot-Pro and OSWorld-G benchmarks, with 73.18% and 68.63% accuracy respectively, and is supported by targeted ablations and a publicly released Grounding Benchmark Toolkit. By enabling specialized agents to solve sub tasks, MEGA-GUI offers robust performance gains, open research tooling, and a path toward safer, more accessible GUI automation systems.

Abstract

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

TL;DR

GUI grounding maps natural language commands to screen coordinates, a task made hard by vast display scales and semantic ambiguity. MEGA-GUI addresses this with a modular, multi stage framework that decouples coarse ROI search from fine grained grounding, guided by a bidirectional ROI zoom and a context aware instruction rewriting module. The approach achieves state of the art on ScreenSpot-Pro and OSWorld-G benchmarks, with 73.18% and 68.63% accuracy respectively, and is supported by targeted ablations and a publicly released Grounding Benchmark Toolkit. By enabling specialized agents to solve sub tasks, MEGA-GUI offers robust performance gains, open research tooling, and a path toward safer, more accessible GUI automation systems.

Abstract

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

Paper Structure

This paper contains 66 sections, 6 figures, 11 tables, 2 algorithms.

Figures (6)

  • Figure 1: The MEGA-GUI Framework. The GUI grounding task is decomposed into three independent stages, each with its own objective. This design enables the modular composition of specialized agents and facilitates systematic evaluation using our Grounding Benchmark Toolkit.
  • Figure 2: Comparison of single-shot method performance (left) against that of our bidirectional ROI zooming method (right) on the ScreenSpot-Pro benchmark using various VLMs. Our method provides a substantial improvement in ROI-containment rate across all evaluated VLMs. The gains are most pronounced at smaller ROI sizes, where the adaptive zoom enables recovery from initial localization errors.
  • Figure 3: Conditional Grounding Accuracy of various VLMs on the ScreenSpot-Pro benchmark. Accuracy is calculated only on ROIs that successfully contain the target element, isolating Stage 2 performance. The overall end-to-end accuracy is represented by the composite score (see Appendix).
  • Figure 4: Containment--ROI curves for SSP (top) and OSG (bottom), contrasting static one-shot cropping (left) with bidirectional zoom (right).
  • Figure 5: Grounding accuracy vs. ROI size for SSP (top) and OSG (bottom). Accuracy generally decreases as ROIs get larger and more cluttered, creating a trade-off with Stage 1's containment rate.
  • ...and 1 more figures