Table of Contents
Fetching ...

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, Ram Ramrakhya, Chao Jia, Jeffrey Nichols, Alexander Toshev, Yinfei Yang, Zhe Gan

TL;DR

Ferret-UI Lite tackles the challenge of building autonomous GUI agents on small, on-device models. It combines a 3B multimodal LLM with a two-stage training regime (supervised fine-tuning followed by reinforcement learning with verifiable rewards) and augments learning with diverse real and synthetic GUI data, plus a zoom-in visual tool-use strategy. The approach achieves strong GUI grounding across multiple benchmarks and competitive navigation performance for a compact model, while revealing limitations in multi-step, on-device navigation relative to larger models. Key findings show that balanced data mixtures and synthetic data improve both grounding and navigation, and that verifiable rewards and CoT data contribute meaningfully, guiding practical design for efficient on-device GUI agents. The work offers actionable lessons for deploying privacy-preserving, low-latency GUI agents on resource-constrained devices.

Abstract

Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

TL;DR

Ferret-UI Lite tackles the challenge of building autonomous GUI agents on small, on-device models. It combines a 3B multimodal LLM with a two-stage training regime (supervised fine-tuning followed by reinforcement learning with verifiable rewards) and augments learning with diverse real and synthetic GUI data, plus a zoom-in visual tool-use strategy. The approach achieves strong GUI grounding across multiple benchmarks and competitive navigation performance for a compact model, while revealing limitations in multi-step, on-device navigation relative to larger models. Key findings show that balanced data mixtures and synthetic data improve both grounding and navigation, and that verifiable rewards and CoT data contribute meaningfully, guiding practical design for efficient on-device GUI agents. The work offers actionable lessons for deploying privacy-preserving, low-latency GUI agents on resource-constrained devices.

Abstract

Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of , , and on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of on AndroidWorld and on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

Paper Structure

This paper contains 19 sections, 2 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparing Ferret-UI Lite with other end-to-end GUI agents. Our model achieves strong results on GUI grounding tasks, surpassing many larger models. However, its performance on multi-step navigation remains limited, underscoring the inherent challenges of developing lightweight, on-device agents capable of robust long-horizon reasoning.
  • Figure 2: An illustration of Ferret-UI Lite on a multi-step GUI navigation task. Human users prompt with a high-level goal in plain text, and the model autonomously interacts with GUI devices through tapping, scrolling, typing, etc., until the task is complete. At each step, the model observes the GUI screen, generates think-plan-act traces, and executes the action.
  • Figure 3: Model architecture and training recipes of Ferret-UI Lite. The model takes a GUI screen and the user instruction as inputs, and predicts chain-of-thought reasoning traces and a low-level action policy to control GUI devices in an end-to-end manner directly. The model is trained through supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR).
  • Figure 4: Synthetic navigation data generation pipeline, which consists of offline data generation based on human-annotated trajectories, and online rollouts collection from a multi-agent system.
  • Figure 4: SFT ablations on the AndroidWorld (AW) benchmark. Success rates (%) are averaged over five runs. The baseline model is built using only human-annotated episodes, without CoT and synthetic data.
  • ...and 8 more figures