Table of Contents
Fetching ...

Building Browser Agents: Architecture, Security, and Practical Solutions

Aram Vardanyan

TL;DR

The paper argues that deploying autonomous browser agents safely in production hinges on architecture rather than mere LLM capability. It introduces a hybrid context management approach (accessibility-tree snapshots plus selective vision), a robust execution layer with a stable element-ref system and bulk actions, and deterministic safety boundaries that enforce specialization and domain allowlisting. Production validation on the WebGames benchmark shows ~85% success across 53 challenges, substantially higher than prior browser agents and approaching human performance, while revealing remaining gaps in advanced vision, real-time interaction, and precision control. The work emphasizes specialization, memory-efficient context handling, and programmatic safety as the practical path forward for reliable, secure autonomous web automation. The findings suggest that future progress will come from architectural discipline and domain-specific configurations rather than chasing universal general intelligence, with significant implications for security, cost, and real-world applicability.

Abstract

Browser agents enable autonomous web interaction but face critical reliability and security challenges in production. This paper presents findings from building and operating a production browser agent. The analysis examines where current approaches fail and what prevents safe autonomous operation. The fundamental insight: model capability does not limit agent performance; architectural decisions determine success or failure. Security analysis of real-world incidents reveals prompt injection attacks make general-purpose autonomous operation fundamentally unsafe. The paper argues against developing general browsing intelligence in favor of specialized tools with programmatic constraints, where safety boundaries are enforced through code instead of large language model (LLM) reasoning. Through hybrid context management combining accessibility tree snapshots with selective vision, comprehensive browser tooling matching human interaction capabilities, and intelligent prompt engineering, the agent achieved approximately 85% success rate on the WebGames benchmark across 53 diverse challenges (compared to approximately 50% reported for prior browser agents and 95.7% human baseline).

Building Browser Agents: Architecture, Security, and Practical Solutions

TL;DR

The paper argues that deploying autonomous browser agents safely in production hinges on architecture rather than mere LLM capability. It introduces a hybrid context management approach (accessibility-tree snapshots plus selective vision), a robust execution layer with a stable element-ref system and bulk actions, and deterministic safety boundaries that enforce specialization and domain allowlisting. Production validation on the WebGames benchmark shows ~85% success across 53 challenges, substantially higher than prior browser agents and approaching human performance, while revealing remaining gaps in advanced vision, real-time interaction, and precision control. The work emphasizes specialization, memory-efficient context handling, and programmatic safety as the practical path forward for reliable, secure autonomous web automation. The findings suggest that future progress will come from architectural discipline and domain-specific configurations rather than chasing universal general intelligence, with significant implications for security, cost, and real-world applicability.

Abstract

Browser agents enable autonomous web interaction but face critical reliability and security challenges in production. This paper presents findings from building and operating a production browser agent. The analysis examines where current approaches fail and what prevents safe autonomous operation. The fundamental insight: model capability does not limit agent performance; architectural decisions determine success or failure. Security analysis of real-world incidents reveals prompt injection attacks make general-purpose autonomous operation fundamentally unsafe. The paper argues against developing general browsing intelligence in favor of specialized tools with programmatic constraints, where safety boundaries are enforced through code instead of large language model (LLM) reasoning. Through hybrid context management combining accessibility tree snapshots with selective vision, comprehensive browser tooling matching human interaction capabilities, and intelligent prompt engineering, the agent achieved approximately 85% success rate on the WebGames benchmark across 53 diverse challenges (compared to approximately 50% reported for prior browser agents and 95.7% human baseline).

Paper Structure

This paper contains 49 sections, 22 figures.

Figures (22)

  • Figure 1: Date picker examples with dense clickable elements. Typical date picker interfaces where each date cell is only 24px, creating challenges for vision-based approaches to accurately identify and click specific dates.
  • Figure 2: Annotated screenshot approach for simple interfaces. Clickable elements are labeled with numbers, allowing the LLM to reference them directly (for example, "click button 5"). This works well when there are relatively few clickable elements.
  • Figure 3: Annotation approach fails with dense interfaces. When attempting to annotate a date picker with many small clickable elements (24px each), the overlay becomes cluttered and confusing, making it difficult for the LLM to understand and reference specific elements.
  • Figure 4: HTML markup for a simple form field. A simple text input labeled "Total Weight (kg)" requires over 100 lines of complex HTML with nested divs, ARIA attributes, and decorative elements, far too verbose to pass efficiently to an LLM.
  • Figure 5: Chrome's Accessibility Tree representation. The browser's accessibility API provides a structured view of the page, exposing semantic information about elements without the verbose HTML markup.
  • ...and 17 more figures