Table of Contents
Fetching ...

From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents

Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-Tür

TL;DR

This work investigates the optimization of context management, focusing on the influence of interaction history and web page representation, and highlights improved agent performance across out-of-distribution scenarios, including unseen websites, categories, and geographic locations through effective context management.

Abstract

Recent advancements in Large Language Model (LLM)-based frameworks have extended their capabilities to complex real-world applications, such as interactive web navigation. These systems, driven by user commands, navigate web browsers to complete tasks through multi-turn dialogues, offering both innovative opportunities and significant challenges. Despite the introduction of benchmarks for conversational web navigation, a detailed understanding of the key contextual components that influence the performance of these agents remains elusive. This study aims to fill this gap by analyzing the various contextual elements crucial to the functioning of web navigation agents. We investigate the optimization of context management, focusing on the influence of interaction history and web page representation. Our work highlights improved agent performance across out-of-distribution scenarios, including unseen websites, categories, and geographic locations through effective context management. These findings provide insights into the design and optimization of LLM-based agents, enabling more accurate and effective web navigation in real-world applications.

From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents

TL;DR

This work investigates the optimization of context management, focusing on the influence of interaction history and web page representation, and highlights improved agent performance across out-of-distribution scenarios, including unseen websites, categories, and geographic locations through effective context management.

Abstract

Recent advancements in Large Language Model (LLM)-based frameworks have extended their capabilities to complex real-world applications, such as interactive web navigation. These systems, driven by user commands, navigate web browsers to complete tasks through multi-turn dialogues, offering both innovative opportunities and significant challenges. Despite the introduction of benchmarks for conversational web navigation, a detailed understanding of the key contextual components that influence the performance of these agents remains elusive. This study aims to fill this gap by analyzing the various contextual elements crucial to the functioning of web navigation agents. We investigate the optimization of context management, focusing on the influence of interaction history and web page representation. Our work highlights improved agent performance across out-of-distribution scenarios, including unseen websites, categories, and geographic locations through effective context management. These findings provide insights into the design and optimization of LLM-based agents, enabling more accurate and effective web navigation in real-world applications.

Paper Structure

This paper contains 21 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Dense Markup Ranking: Figure illustrates the DMR process, where a web page's Document Object Model (DOM) is generated from the HTML and is parsed, and both elements and the current state are encoded into vectors. We compute the cosine similarity for each candidate element and rank the elements relevant to the user's query to facilitate informed navigation and interaction decisions.
  • Figure 2: The figure depicts the interaction between a user and a web agent, illustrating each turn. The user's and agent's utterances are displayed in blue speech bubbles on the left and right, while the agent's actions are shown in red on the right. The red box shows a single agent action along with the arguments associated with each action, such as the URL to load. The agent's internal state, including its understanding and actions at each point (timestamped as "t"), is highlighted in yellow.
  • Figure 3: Recall@10 performance using MiniLM across interaction history lengths of 5,10, and 15 turns on the four Test-OOD splits. The plot highlights the importance of balancing a longer interaction history to enhance the model's ability to select relevant candidates in different test scenarios.