Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Zhen Yang; Zi-Yi Dou; Di Feng; Forrest Huang; Anh Nguyen; Keen You; Omar Attia; Yuhao Yang; Michael Feng; Haotian Zhang; Ram Ramrakhya; Chao Jia; Jeffrey Nichols; Alexander Toshev; Yinfei Yang; Zhe Gan

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, Ram Ramrakhya, Chao Jia, Jeffrey Nichols, Alexander Toshev, Yinfei Yang, Zhe Gan

TL;DR

Ferret-UI Lite tackles the challenge of building autonomous GUI agents on small, on-device models. It combines a 3B multimodal LLM with a two-stage training regime (supervised fine-tuning followed by reinforcement learning with verifiable rewards) and augments learning with diverse real and synthetic GUI data, plus a zoom-in visual tool-use strategy. The approach achieves strong GUI grounding across multiple benchmarks and competitive navigation performance for a compact model, while revealing limitations in multi-step, on-device navigation relative to larger models. Key findings show that balanced data mixtures and synthetic data improve both grounding and navigation, and that verifiable rewards and CoT data contribute meaningfully, guiding practical design for efficient on-device GUI agents. The work offers actionable lessons for deploying privacy-preserving, low-latency GUI agents on resource-constrained devices.

Abstract

Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

TL;DR

Abstract

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)