Table of Contents
Fetching ...

How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia

TL;DR

The paper addresses the challenge of training open-source LLM web agents capable of multi-turn interactions while controlling compute costs. It introduces a statistically grounded two-stage pipeline that combines off-policy SFT from a large teacher with on-policy GRPO-based RL, and uses bootstrap-based hyperparameter analysis across 1,370 configurations to robustly allocate compute. Key findings show that a hybrid SFT+RL strategy consistently outperforms either approach alone, can achieve the peak performance of pure SFT at roughly 45% less compute, and closes the gap with closed-source models on MiniWoB++ (with some remaining difficulty on WorkArena). These results offer a practical, reproducible blueprint for budget-aware open-source LLM web agents operating in complex multi-step environments, along with actionable insights into decoding, curriculum, and normalization choices that stabilize training.

Abstract

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

How to Train Your LLM Web Agent: A Statistical Diagnosis

TL;DR

The paper addresses the challenge of training open-source LLM web agents capable of multi-turn interactions while controlling compute costs. It introduces a statistically grounded two-stage pipeline that combines off-policy SFT from a large teacher with on-policy GRPO-based RL, and uses bootstrap-based hyperparameter analysis across 1,370 configurations to robustly allocate compute. Key findings show that a hybrid SFT+RL strategy consistently outperforms either approach alone, can achieve the peak performance of pure SFT at roughly 45% less compute, and closes the gap with closed-source models on MiniWoB++ (with some remaining difficulty on WorkArena). These results offer a practical, reproducible blueprint for budget-aware open-source LLM web agents operating in complex multi-step environments, along with actionable insights into decoding, curriculum, and normalization choices that stabilize training.

Abstract

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

Paper Structure

This paper contains 52 sections, 14 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Compute–performance frontier on MiniWoB++ (results averaged over two seeds). The blue curve shows pure SFT on teacher demonstrations. Warm-colored curves represent hybrid runs that branch off from SFT checkpoints and continue training with RL. Early transitions to RL push the Pareto frontier achieving higher success rates for the same compute and is the only approach able to achieve over $30\%$ improvement on both held-out goals (left) and held-out tasks (right) closing the gap between open and closed-source models. See \ref{['fig:compute_qwen']} for the corresponding plot with Qwen2.5 7B.
  • Figure 2: Per-task performance of SFT and SFT+RL agents on WorkArena. The Llama 3.1 8B model is initially fine-tuned for 4 epochs on trajectories from a teacher Llama 3.3 70B model. Training then continues either with additional SFT or with GRPO fine-tuning up to epoch 20. The teacher model's success rate is also shown.
  • Figure 3: Bootstrap analysis ($n=1000$ samples) of hyperparameter optimization across different SFT compute budgets on training held out tasks. Each subplot examines a different hyperparameter, including increasing SFT compute: the base instruct model (left), +2.5$\times 10^{18}$ SFT FLOPs (middle), and +7.6e$\times 10^{18}$ SFT FLOPs (right). For each hyperparameter-compute combination, the top panel shows relative reward performance with error bars indicating 95% confidence intervals, while the bottom panel displays win rates representing the percentage of bootstrap iterations where each parameter value achieved maximum performance. Results demonstrate that optimal hyperparameter values shift as model pre-training compute increases, suggesting that hyperparameter selection should be adapted to the computational budget allocated to SFT.
  • Figure 4: Compute–performance frontier on MiniWoB++ (results averaged over two seeds) for Qwen2.5 7B. The blue curve shows pure SFT on teacher demonstrations. Warm-colored curves represent hybrid runs that branch off from SFT checkpoints and continue training with RL. Early transitions to RL push the Pareto frontier achieving higher success rates for the same compute and is the only approach able to achieve over $30\%$ improvement on both held-out goals (left) and held-out tasks (right).
  • Figure 5: Per task performance of SFT and SFT+RL agents on MiniWob++.
  • ...and 4 more figures