Building Browser Agents: Architecture, Security, and Practical Solutions
Aram Vardanyan
TL;DR
The paper argues that deploying autonomous browser agents safely in production hinges on architecture rather than mere LLM capability. It introduces a hybrid context management approach (accessibility-tree snapshots plus selective vision), a robust execution layer with a stable element-ref system and bulk actions, and deterministic safety boundaries that enforce specialization and domain allowlisting. Production validation on the WebGames benchmark shows ~85% success across 53 challenges, substantially higher than prior browser agents and approaching human performance, while revealing remaining gaps in advanced vision, real-time interaction, and precision control. The work emphasizes specialization, memory-efficient context handling, and programmatic safety as the practical path forward for reliable, secure autonomous web automation. The findings suggest that future progress will come from architectural discipline and domain-specific configurations rather than chasing universal general intelligence, with significant implications for security, cost, and real-world applicability.
Abstract
Browser agents enable autonomous web interaction but face critical reliability and security challenges in production. This paper presents findings from building and operating a production browser agent. The analysis examines where current approaches fail and what prevents safe autonomous operation. The fundamental insight: model capability does not limit agent performance; architectural decisions determine success or failure. Security analysis of real-world incidents reveals prompt injection attacks make general-purpose autonomous operation fundamentally unsafe. The paper argues against developing general browsing intelligence in favor of specialized tools with programmatic constraints, where safety boundaries are enforced through code instead of large language model (LLM) reasoning. Through hybrid context management combining accessibility tree snapshots with selective vision, comprehensive browser tooling matching human interaction capabilities, and intelligent prompt engineering, the agent achieved approximately 85% success rate on the WebGames benchmark across 53 diverse challenges (compared to approximately 50% reported for prior browser agents and 95.7% human baseline).
