Table of Contents
Fetching ...

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang

TL;DR

Ponder & Press introduces a two-stage, vision-only GUI agent that splits task understanding and element localization: an Instruction Interpreter (using generalist MLLMs) decomposes high-level goals into actionable steps, and a Visual Element Locator (GUI-grounding model) maps element descriptions to precise pixel coordinates. This divide-and-conquer approach yields state-of-the-art results on multiple GUI benchmarks, including +22.5% improvement on ScreenSpot and strong performance in offline and interactive web, desktop, and mobile tasks. By training on a small labeled GUI-grounding dataset and using visual input only, the framework generalizes across diverse software without HTML or accessibility data, supporting human-like, flexible automation. The work highlights the potential of vision-based GUI agents for general computer control and provides a practical blueprint for scalable, cross-platform automation.

Abstract

Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder-press-page/

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

TL;DR

Ponder & Press introduces a two-stage, vision-only GUI agent that splits task understanding and element localization: an Instruction Interpreter (using generalist MLLMs) decomposes high-level goals into actionable steps, and a Visual Element Locator (GUI-grounding model) maps element descriptions to precise pixel coordinates. This divide-and-conquer approach yields state-of-the-art results on multiple GUI benchmarks, including +22.5% improvement on ScreenSpot and strong performance in offline and interactive web, desktop, and mobile tasks. By training on a small labeled GUI-grounding dataset and using visual input only, the framework generalizes across diverse software without HTML or accessibility data, supporting human-like, flexible automation. The work highlights the potential of vision-based GUI agents for general computer control and provides a practical blueprint for scalable, cross-platform automation.

Abstract

Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder-press-page/

Paper Structure

This paper contains 26 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Different types of frameworks for vision-based GUI agents.
  • Figure 2: Ponder&Press improves vision-based GUI agent on a broad range of tasks.
  • Figure 3: The framework of Ponder&Press agent. The framework consists of two core components: an Instruction Interpreter that translates high-level user instructions into actionable steps, and a Visual Element Locator that localizes GUI elements for interactions such as clicking or typing. Our method ensures that complex instructions can be decomposed and precisely executed within diverse GUIs.
  • Figure 4: Case study of Ponder&Press on an office task.
  • Figure 5: A possible end-to-end visual GUI agent framework.
  • ...and 4 more figures