How to Train Your LLM Web Agent: A Statistical Diagnosis
Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia
TL;DR
The paper addresses the challenge of training open-source LLM web agents capable of multi-turn interactions while controlling compute costs. It introduces a statistically grounded two-stage pipeline that combines off-policy SFT from a large teacher with on-policy GRPO-based RL, and uses bootstrap-based hyperparameter analysis across 1,370 configurations to robustly allocate compute. Key findings show that a hybrid SFT+RL strategy consistently outperforms either approach alone, can achieve the peak performance of pure SFT at roughly 45% less compute, and closes the gap with closed-source models on MiniWoB++ (with some remaining difficulty on WorkArena). These results offer a practical, reproducible blueprint for budget-aware open-source LLM web agents operating in complex multi-step environments, along with actionable insights into decoding, curriculum, and normalization choices that stabilize training.
Abstract
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
