Table of Contents
Fetching ...

Fara-7B: An Efficient Agentic Model for Computer Use

Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer Whitehead, Andrew Zhao

TL;DR

This work tackles the data scarcity barrier in training computer-use agents by introducing FaraGen, a scalable synthetic data engine that automates task proposal, solving, and verification to create high-quality, diverse web trajectories at roughly $1$ per task. Leveraging this data, the authors train Fara-7B, a compact native CUA that perceives screens with only screenshots and ground actions via coordinates, enabling on-device inference without relying on brittle DOM scaffolding. Fara-7B achieves state-of-the-art results among 7B-scale models on multiple benchmarks (WebVoyager, Online-Mind2Web, WebTailBench) and remains competitive with larger frontier models, underscoring the value of scalable synthetic data for small agentic models. The paper also introduces WebTailBench, a new, real-world benchmark targeting underrepresented tasks to better measure generalization and safety in CUAs. Together, these contributions demonstrate that high-quality synthetic data can unlock practical, on-device agentic capabilities with strong performance and safer behavior in real-world web tasks.

Abstract

Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench -- our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

Fara-7B: An Efficient Agentic Model for Computer Use

TL;DR

This work tackles the data scarcity barrier in training computer-use agents by introducing FaraGen, a scalable synthetic data engine that automates task proposal, solving, and verification to create high-quality, diverse web trajectories at roughly per task. Leveraging this data, the authors train Fara-7B, a compact native CUA that perceives screens with only screenshots and ground actions via coordinates, enabling on-device inference without relying on brittle DOM scaffolding. Fara-7B achieves state-of-the-art results among 7B-scale models on multiple benchmarks (WebVoyager, Online-Mind2Web, WebTailBench) and remains competitive with larger frontier models, underscoring the value of scalable synthetic data for small agentic models. The paper also introduces WebTailBench, a new, real-world benchmark targeting underrepresented tasks to better measure generalization and safety in CUAs. Together, these contributions demonstrate that high-quality synthetic data can unlock practical, on-device agentic capabilities with strong performance and safer behavior in real-world web tasks.

Abstract

Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench -- our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

Paper Structure

This paper contains 34 sections, 3 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: WebVoyager accuracy and cost of Fara-7B B to other computer use agents (CUAs) and Set-of-Marks (SoM) Agents. Cost is computed based on the number of input and output tokens each model consumes by price per token. Both Fara-7B and UI-TARS-1.5-7B have the same token cost but Fara-7B completes tasks in half the steps.
  • Figure 2: FaraGen - A breakdown of our various Task Proposal workflows, emphasizing the need for seed URLs that reflect real human users' web needs. We find FaraGen to be capable of generating diverse trajectories with high throughput and reliability.
  • Figure 3: FaraGen - Distributional differences between two publicly available sources of seed URLs: Tranco and Clueweb22. We find that Clueweb22 is a more valuable source of task data because it contains a lower fraction of corporate landing pages which tend to have a narrower scope of actionable tasks achievable on those pages.
  • Figure 4: FaraGen - The Task Solving pipeline is built on top of a Magentic-One multi-agent framework, with an orchestrator agent that plans a and directs a Websurfer agent that can take broswer actions. A set of verifiers agents identifies successfully solved trajectories for use in training Fara-7B.
  • Figure 5: Fara-7B model flow: Fara is a native CUA model. It operates directly on pixel input and outputs atomic actions such as clicking, typing or scrolling. Fara can take multiple steps to accomplish a task and is trained to stop and hand back control when it reaches critical points.
  • ...and 4 more figures