Table of Contents
Fetching ...

TinyClick: Single-Turn Agent for Empowering GUI Automation

Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz

TL;DR

TinyClick introduces a compact single-turn UI agent built on Florence-2 Base (0.27B parameters) that accurately localizes UI elements from natural language commands with ~$250$ ms latency and a training budget of ~56 GPU-hours. Through multitask vision-language training and MLLM-based data augmentation, it achieves state-of-the-art-like accuracy on Screenspot (73.8%) and OmniAct (58.3%) benchmarks while remaining orders of magnitude smaller than competing models. The approach demonstrates that extensive visual pretraining and diverse multitask objectives enable effective GUI grounding with limited compute, supporting more sustainable and accessible GUI agent research. This work also outlines ablations and fail analyses that highlight the importance of multitask data and annotation strategies for grounding performance. Overall, TinyClick provides a practical baseline for on-device UI agents and motivates future multi-turn extensions and broader application of cheap MLLM augmentation in GUI tasks.

Abstract

We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.

TinyClick: Single-Turn Agent for Empowering GUI Automation

TL;DR

TinyClick introduces a compact single-turn UI agent built on Florence-2 Base (0.27B parameters) that accurately localizes UI elements from natural language commands with ~ ms latency and a training budget of ~56 GPU-hours. Through multitask vision-language training and MLLM-based data augmentation, it achieves state-of-the-art-like accuracy on Screenspot (73.8%) and OmniAct (58.3%) benchmarks while remaining orders of magnitude smaller than competing models. The approach demonstrates that extensive visual pretraining and diverse multitask objectives enable effective GUI grounding with limited compute, supporting more sustainable and accessible GUI agent research. This work also outlines ablations and fail analyses that highlight the importance of multitask data and annotation strategies for grounding performance. Overall, TinyClick provides a practical baseline for on-device UI agents and motivates future multi-turn extensions and broader application of cheap MLLM augmentation in GUI tasks.

Abstract

We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.

Paper Structure

This paper contains 13 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Example command of the downstream task. TinyClick receives a screenshot and user command and predicts bounding box of the UI element. Stickers from Flaticon flaticon.
  • Figure 2: During the training, the model receives a question and generates an answer. Both question and answer can contain location tokens of the specific UI element. Here, the first question is about element description and the second one is a command to click specific item. Multiple different tasks can be associated with a single UI element, allowing the model to gain a better understanding of the UI. Stickers from Flaticon flaticon.