Table of Contents
Fetching ...

OpAgent: Operator Agent for Web Navigation

Yuyu Guo, Wenjie Yang, Siyuan Yang, Ziyang Liu, Cheng Chen, Yuan Wei, Yun Hu, Yang Huang, Guoliang Hao, Dongsheng Yuan, Jianming Wang, Xin Chen, Hang Yu, Lei Lei, Peng Di

TL;DR

OpAgent addresses autonomous web navigation by enabling online, interactive learning on unconstrained websites rather than relying on static offline data. It combines a Vision-Language Model-based foundation trained via Hierarchical Multitask SFT, an Online Agentic RL loop with a Hybrid Reward (WebJudge + RDTree), and a modular Operator Agent (Planner/Grounder/Reflector/Summarizer) to perform long-horizon tasks with robust error recovery. The approach achieves a new SOTA on WebArena with a $71.6\%$ success rate and demonstrates a RL-improved pass@5 of $38.1\%$; it also shows how collaborative agentic components improve reliability. Overall, the results indicate a practical, scalable path for robust real-world web automation and deployment.

Abstract

To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1\% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbf{OpAgent}, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent's performance to a new State-of-the-Art (SOTA) success rate of \textbf{71.6\%}.

OpAgent: Operator Agent for Web Navigation

TL;DR

OpAgent addresses autonomous web navigation by enabling online, interactive learning on unconstrained websites rather than relying on static offline data. It combines a Vision-Language Model-based foundation trained via Hierarchical Multitask SFT, an Online Agentic RL loop with a Hybrid Reward (WebJudge + RDTree), and a modular Operator Agent (Planner/Grounder/Reflector/Summarizer) to perform long-horizon tasks with robust error recovery. The approach achieves a new SOTA on WebArena with a success rate and demonstrates a RL-improved pass@5 of ; it also shows how collaborative agentic components improve reliability. Overall, the results indicate a practical, scalable path for robust real-world web automation and deployment.

Abstract

To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1\% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbf{OpAgent}, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent's performance to a new State-of-the-Art (SOTA) success rate of \textbf{71.6\%}.
Paper Structure (21 sections, 4 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our proposed OpAgent achieves a new state-of-the-art (SOTA) success rate of 71.6% on the WebArena benchmark.
  • Figure 2: Overall architecture and training pipeline of O p A gen t. (Top) The system facilitates a multi-turn interaction loop where the Operator Agent executes actions and receives observations from live websites to fulfill user queries. (Bottom-Left) The development of the agent follows a hierarchical post-training paradigm: MT-SFT on offline data to establish foundational capabilities, followed by RL in the Wild for adaptive policy optimization in real-world environments. The agentic framework orchestrates modular roles including Planner, Grounder, Reflector and Summarizer. (Bottom-Right) A sample trajectory demonstrates the step-wise execution of a complex refund request task.
  • Figure 3: Illustration of the Hierarchical Multi-Task Supervised Fine-tuning (MT-SFT) pipeline. We initialize the VLM by joint training on a diverse mixture of self-collected data, categorized into three functional primitives: (1) Planning (via WebDreamer) for high-level goal decomposition and state prediction; (2) Acting (via Mind2Web and Aguvis) for low-level action execution; and (3) Grounding (via UGround) for spatial element localization. A Task-specific Effective Weighting strategy is employed to balance the learning gradients across these heterogeneous tasks, ensuring a robust foundational policy for subsequent RL optimization.
  • Figure 4: Hierarchical Infrastructure for the Web Agent RL. The system is organized into four functional layers: (1) the Environment Layer featuring a hybrid sandbox consisting of self-hosted the open Wild Web and WebArena on Alibaba Cloud ECS; (2) the Infrastructure Layer managing a distributed browser cluster for scalable data collection; (3) the Execution Layer utilizing a high-concurrency Playwright engine to translate semantic actions into API commands; and (4) the Decision Layer where the VLM-based agent performs reasoning and action generation. The solid arrows (left) denote the upward Action Flow, while the dashed arrows (right) represent the downward Observation Flow of multimodal feedback.
  • Figure 5: Overview of the OpAgent Training Infrastructure and Reinforcement Learning Loop. The framework consists of three core phases: (1) Task Generation, where a Query-Agent synthesizes realistic navigation goals on filtered top URLs; (2) Interactive Rollout, where the VLM-based agent interacts with a hybrid environment (including WeaveFox, unconstrained Wild Web sites, and the self-hosted WebArena) via a high-concurrency Playwright engine; and (3) Hybrid Reward Evaluation. The reward system integrates an RDTree (Rule-based Decision Tree) to derive process-based rewards for intermediate steps, and Webjudge, which assesses visual trajectory screenshots to provide a holistic success score.
  • ...and 4 more figures