Table of Contents
Fetching ...

Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, Kimin Lee

TL;DR

The paper tackles the difficulty of LLM-based web agents in processing complex web page observations. It introduces LCoW, a contextualization module that translates raw observations into concise, task-grounded representations, and trains it via an iterative, reward-based procedure using multiple LLMs. Through extensive experiments on WebShop, WorkArena, and WebArena, LCoW delivers consistent performance gains across both closed- and open-source models, achieving state-of-the-art on WebShop with Gemini-1.5-flash and demonstrating strong generalization to unseen task types and models. The work also analyzes the nature of the contextualized observations and discusses limitations and future directions, suggesting scalability and efficiency improvements for broader applicability.

Abstract

Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

TL;DR

The paper tackles the difficulty of LLM-based web agents in processing complex web page observations. It introduces LCoW, a contextualization module that translates raw observations into concise, task-grounded representations, and trains it via an iterative, reward-based procedure using multiple LLMs. Through extensive experiments on WebShop, WorkArena, and WebArena, LCoW delivers consistent performance gains across both closed- and open-source models, achieving state-of-the-art on WebShop with Gemini-1.5-flash and demonstrating strong generalization to unseen task types and models. The work also analyzes the nature of the contextualized observations and discusses limitations and future directions, suggesting scalability and efficiency improvements for broader applicability.

Abstract

Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Paper Structure

This paper contains 46 sections, 2 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Success rate of the Gemini-1.5-flash agent on 40 WorkArena tasks. We selected a subset of 40 tasks by choosing the first 40 tasks based on the task indices. When the agent leverages observations contextualized by GPT-4o (yellow), its success rate improves by 31%, with further improvements achieved with our method (green).
  • Figure 2: (Top) In the conventional pipeline, LLM agents decide on the next action based on raw, complex web page observations (e.g., HTML, accessibility trees), which often hinder accurate decision making. (Bottom) In our proposed pipeline, a contextualization module transforms these complex web page observations into a more comprehensible format, thereby enabling LLM agents to make more accurate decisions by enhancing their understanding of the web page.
  • Figure 3: An example of a input of contextualization module including lengthy web page observation (left) and an observation contextualized by the contextualization module trained using LCoW (right). The module converts raw observations into a more concise form to enhance decision making in agents. The prompt used is provided in Appendix \ref{['ss:prompt']}.
  • Figure 4: Illustration of sampling optimal contextualization. First, we sample multiple candidates of contextualized observations, given user instruction [TASK], previous actions $a_{<t}$, and observation $o_t$. Subsequently, multiple LLM agents predict the next action based on each candidate, and the reward for each candidate is computed according to how many LLM agents correctly predict the ground-truth action $a_t$.
  • Figure 5: Success rate on 500 evaluation tasks from WebShop. Average human performance and expert human performance are 50% and 59.6%, respectively yao2022webshop. The Gemini-1.5-flash agent with the contextualization module trained for three iterations achieves a state-of-the-art success rate of 62.8%, outperforming the human expert performance, as well as previous baselines yao2022reactfuruta2023multimodalputta2024agentsridhar2023ashma2024laserllmagentstatespacegur2023understandinghtmllargelanguage.
  • ...and 9 more figures