Table of Contents
Fetching ...

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoshchuk, Artur Janicki

TL;DR

This paper introduces ClickAgent, a novel framework for building autonomous agents that addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements.

Abstract

With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

TL;DR

This paper introduces ClickAgent, a novel framework for building autonomous agents that addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements.

Abstract

With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.

Paper Structure

This paper contains 10 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: In ClickAgent, the MLLM is responsible for reasoning, reflection and action planning. In this example, the MLLM generates a UI command, and a specialized UI location model identifies the coordinates of the corresponding icon on the screen.
  • Figure 2: ClickAgent performance on the AITW General and AITW WebShopping using different MLLMs. In all cases, the TinyClick model is employed for UI location.