InSTA: Towards Internet-Scale Training For Agents

Brandon Trabucco; Gunnar Sigurdsson; Robinson Piramuthu; Ruslan Salakhutdinov

InSTA: Towards Internet-Scale Training For Agents

Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov

TL;DR

The paper tackles the bottleneck of human-annotated data for training web navigation agents by proposing InSTA, an internet-scale pipeline with a task proposer, agents, and a language-model judge. It automatically generates tasks on 150k safe websites from an initial 1M, creates trajectories via LLM agents, and filters them with a judge to curate high-quality training data, assembling a multimodal dataset of 2.2M screenshots and 2.2M action traces. Trained on this data, small models like Qwen 3 1.7B match frontier LLMs on several benchmarks, achieving a top 56.9% success rate and reaching 94.7% of Gemini 2.5 Flash performance while requiring far less compute. The work releases code, models, and data, and points to future expansions including further scaling, RL-based judge optimization, and multimodal task generation to broaden applicability and robustness of internet-scale agents.

Abstract

The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM filters trajectories by judging their success. Language models are powerful data curation tools, identifying harmful content with an accuracy of 97%, judging successful trajectories with an accuracy of 82.6%, and producing effective data. We train agents based on Qwen 3 1.7B that are competitive with frontier LLMs as web agents, while being smaller and faster. Our top agent reaches a success rate of 56.9%, outperforming the data collection policy Qwen 3 235B, a 235 times larger Llama 4 Maverick, and reaching 94.7% of the performance of Gemini 2.5 Flash. We are releasing code, models and data at: https://data-for-agents.github.io.

InSTA: Towards Internet-Scale Training For Agents

TL;DR

Abstract

InSTA: Towards Internet-Scale Training For Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)