Table of Contents
Fetching ...

InSTA: Towards Internet-Scale Training For Agents

Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov

TL;DR

The paper tackles the bottleneck of human-annotated data for training web navigation agents by proposing InSTA, an internet-scale pipeline with a task proposer, agents, and a language-model judge. It automatically generates tasks on 150k safe websites from an initial 1M, creates trajectories via LLM agents, and filters them with a judge to curate high-quality training data, assembling a multimodal dataset of 2.2M screenshots and 2.2M action traces. Trained on this data, small models like Qwen 3 1.7B match frontier LLMs on several benchmarks, achieving a top 56.9% success rate and reaching 94.7% of Gemini 2.5 Flash performance while requiring far less compute. The work releases code, models, and data, and points to future expansions including further scaling, RL-based judge optimization, and multimodal task generation to broaden applicability and robustness of internet-scale agents.

Abstract

The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM filters trajectories by judging their success. Language models are powerful data curation tools, identifying harmful content with an accuracy of 97%, judging successful trajectories with an accuracy of 82.6%, and producing effective data. We train agents based on Qwen 3 1.7B that are competitive with frontier LLMs as web agents, while being smaller and faster. Our top agent reaches a success rate of 56.9%, outperforming the data collection policy Qwen 3 235B, a 235 times larger Llama 4 Maverick, and reaching 94.7% of the performance of Gemini 2.5 Flash. We are releasing code, models and data at: https://data-for-agents.github.io.

InSTA: Towards Internet-Scale Training For Agents

TL;DR

The paper tackles the bottleneck of human-annotated data for training web navigation agents by proposing InSTA, an internet-scale pipeline with a task proposer, agents, and a language-model judge. It automatically generates tasks on 150k safe websites from an initial 1M, creates trajectories via LLM agents, and filters them with a judge to curate high-quality training data, assembling a multimodal dataset of 2.2M screenshots and 2.2M action traces. Trained on this data, small models like Qwen 3 1.7B match frontier LLMs on several benchmarks, achieving a top 56.9% success rate and reaching 94.7% of Gemini 2.5 Flash performance while requiring far less compute. The work releases code, models, and data, and points to future expansions including further scaling, RL-based judge optimization, and multimodal task generation to broaden applicability and robustness of internet-scale agents.

Abstract

The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM filters trajectories by judging their success. Language models are powerful data curation tools, identifying harmful content with an accuracy of 97%, judging successful trajectories with an accuracy of 82.6%, and producing effective data. We train agents based on Qwen 3 1.7B that are competitive with frontier LLMs as web agents, while being smaller and faster. Our top agent reaches a success rate of 56.9%, outperforming the data collection policy Qwen 3 235B, a 235 times larger Llama 4 Maverick, and reaching 94.7% of the performance of Gemini 2.5 Flash. We are releasing code, models and data at: https://data-for-agents.github.io.

Paper Structure

This paper contains 38 sections, 5 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Overview of the InSTA pipeline. Our work unlocks a dynamic internet-scale environment that allows training small models to match frontier LLMs as agents, on a fraction of the budget. Starting from the top 1M sites on the internet, we annotate 150k sites with challenging agentic tasks, and release the entire pipeline, including code, models and an official huggingface dataset, on our website: https://data-for-agents.github.io.
  • Figure 2: Annotating 150k live sites with agentic tasks. Starting from 1,000,000 websites, we employ a pretrained language model that marks sites as safe/unsafe for annotation, and assigns a realistic task that a hypothetical user might want to accomplish on each site. The task proposer aggressively filters out 85% of websites from the pipeline, resulting in 150k safe websites annotated with realistic tasks.
  • Figure 3: Most frequent words in our tasks. This wordcloud shows the top 500 most frequent words in tasks from the training set of our official huggingface dataset. The size of each word corresponds to its frequency in the dataset. Our tasks span diverse categories and lexicon.
  • Figure 4: Automatic evaluation for agents with language model judges. Building on the large and diverse set of tasks generated by the pipeline, we employ pretrained language models to attempt and evaluate web navigation tasks. We dispatch language model agents to perform tasks by making calls to the Playwright API. We then employ language model judges to evaluate the trajectories.
  • Figure 5: Language models are robust evaluators. We measure the accuracy of language models for detecting successful trajectories, and find that accuracy remains stable relative to PageRank values (left plot). As models become more confident, their accuracy improves (right plot), suggesting confidence is a useful proxy for the reliability of their predictions.
  • ...and 10 more figures