Table of Contents
Fetching ...

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste

TL;DR

FocusAgent tackles the challenge of extremely long web observations by applying a lightweight LLM retriever to prune AxTree content according to the task goal. The two-stage pipeline preserves essential planning information while dramatically reducing observation size, enabling efficient and robust web reasoning. Empirical results on WorkArena and WebArena show FocusAgent matching strong baselines with over 50% observation reduction; a security-focused variant also markedly lowers prompt-injection attack success while maintaining attack-free performance. The work highlights targeted LLM-based retrieval as a practical approach for building efficient, effective, and safer web agents, with open-source implementations and clear avenues for further improvements in prompting and attack mitigation.

Abstract

Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

TL;DR

FocusAgent tackles the challenge of extremely long web observations by applying a lightweight LLM retriever to prune AxTree content according to the task goal. The two-stage pipeline preserves essential planning information while dramatically reducing observation size, enabling efficient and robust web reasoning. Empirical results on WorkArena and WebArena show FocusAgent matching strong baselines with over 50% observation reduction; a security-focused variant also markedly lowers prompt-injection attack success while maintaining attack-free performance. The work highlights targeted LLM-based retrieval as a practical approach for building efficient, effective, and safer web agents, with open-source implementations and clear avenues for further improvements in prompting and attack mitigation.

Abstract

Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

Paper Structure

This paper contains 50 sections, 5 equations, 32 figures, 7 tables.

Figures (32)

  • Figure 1: Overview of FocusAgent pipeline with and without prompt injection attacks. The first stage is for retrieving relevant lines from the observation, including removing prompt injections if present. The second stage uses the pruned observation to predict actions to complete the task goal.
  • Figure 1: Success Rates (SR) and Standard Error ($\pm$SE) of agents leveraging different retrieval methods on WorkArena L1 using GPT-4.1 as the backbone model for all agents and GPT-4.1-mini for the retriever of FocusAgent. We report the average pruning (Prun.) the method achieves on the benchmark.
  • Figure 2: Illustration of the operation of FocusAgent’s retrieval component for the task “Upvote the newest post in the deeplearning subreddit” at step 2 on WebArena (task ID 407). The retrieval procedure consists of three stages: (1) line numbers are systematically assigned to each element of the AxTree, after which a prompt is constructed incorporating the task objective and, where applicable, the interaction history; (2) the LLM generates a Chain-of-Thought (CoT) together with the line ranges identified as relevant to task completion; and (3) a revised AxTree is produced by removing irrelevant lines and inserting a placeholder that specifies the number of lines pruned.
  • Figure 3: SR vs average pruning across agents. For cost efficiency, pruning should remove at least 20% of the AxTree tokens while maintaining performance close to using the full tree.
  • Figure 4: Original vs Pruned tokens of AxTrees for FocusAgent(4.1-mini) with GPT-4.1 as backbone on benchmarks. Both figures show the pruning distribution of step-wise AxTrees.
  • ...and 27 more figures