Table of Contents
Fetching ...

Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenqi Zhang, Wenqiao Zhang, Kaitao Song, Weiming Lu, Yueting Zhuang

TL;DR

This paper tackles GUI grounding under complex interface layouts by introducing Focus, a dual-system framework that blends fast, intuitive predictions with slow, deliberate analysis. It decomposes grounding into interface summarization, visual focused analysis, and precise coordinate prediction, and uses an adaptive switching mechanism to balance efficiency and accuracy based on task complexity. A 300K-sample data synthesis pipeline and token-based training enable robust generalization, with a 2B-parameter model achieving state-of-the-art results on ScreenSpot and ScreenSpot-Pro benchmarks. The work demonstrates that combining rapid perception with structured, task-driven reasoning substantially improves element localization in challenging GUIs, offering practical benefits for GUI automation and intelligent agents.

Abstract

Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios.

Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

TL;DR

This paper tackles GUI grounding under complex interface layouts by introducing Focus, a dual-system framework that blends fast, intuitive predictions with slow, deliberate analysis. It decomposes grounding into interface summarization, visual focused analysis, and precise coordinate prediction, and uses an adaptive switching mechanism to balance efficiency and accuracy based on task complexity. A 300K-sample data synthesis pipeline and token-based training enable robust generalization, with a 2B-parameter model achieving state-of-the-art results on ScreenSpot and ScreenSpot-Pro benchmarks. The work demonstrates that combining rapid perception with structured, task-driven reasoning substantially improves element localization in challenging GUIs, offering practical benefits for GUI automation and intelligent agents.

Abstract

Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios.

Paper Structure

This paper contains 25 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of GUI grounding approaches. (a) Fast grounding system (e.g., SeeClick cheng2024seeclickharnessingguigrounding, ShowUI lin2024showuivisionlanguageactionmodelgui) directly predicts target locations without explicit reasoning. (b) Our Focus framework introduces a dual-system approach combining fast grounding with deliberate analysis, dynamically switching between systems based on task complexity.
  • Figure 2: Overview of Focus data construction and training process: the system performs interface summarization and task-oriented visual focused analysis for grounding. The middle shows complete examples of fast and slow grounding data. Focus dynamically switches between fast and slow grounding systems, with a complete example of slow-system grounding shown at the bottom.
  • Figure 3: Case study comparison between ShowUI and Focus. ShowUI struggles with complex GUI scenarios, while Focus excels through its dual-system.
  • Figure 4: Impact of scaling factor $\alpha$ on Focus's performance, where "w/o $\alpha$" represents baseline without Adaptive System Switching. At $\alpha$ = 0.6, Focus achieves +2.6% accuracy improvement while reducing processing time by 0.6s compared to baseline, demonstrating effective balance between accuracy and efficiency.
  • Figure 5: Distribution of fast and slow system activation across different element types in ScreenSpot.
  • ...and 2 more figures