Table of Contents
Fetching ...

WEPO: Web Element Preference Optimization for LLM-based Web Navigation

Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu

TL;DR

This work tackles improving autonomous web navigation by exploiting HTML structure through preference learning. It proposes WEPO, a framework that uses unsupervised sampling of non-salient HTML elements as negative samples and trains with Direct Preference Optimization to align model actions with user intent. On the Mind2Web benchmark, WEPO achieves state-of-the-art performance, outperforming baselines such as WebAgent and CogAgent and demonstrating strong generalization across domains and tasks. The results indicate that contrastive, preference-based fine-tuning can substantially enhance web-page-based task execution, with promising future directions including HTML-specific encoders and scalability to longer contexts.

Abstract

The rapid advancement of autonomous web navigation has significantly benefited from grounding pretrained Large Language Models (LLMs) as agents. However, current research has yet to fully leverage the redundancy of HTML elements for contrastive training. This paper introduces a novel approach to LLM-based web navigation tasks, called Web Element Preference Optimization (WEPO). WEPO utilizes unsupervised preference learning by sampling distance-based non-salient web elements as negative samples, optimizing maximum likelihood objective within Direct Preference Optimization (DPO). We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over WebAgent and 5.3% over the visual language model CogAgent baseline. Our findings underscore the potential of preference optimization to enhance web navigation and other web page based tasks, suggesting a promising direction for future research.

WEPO: Web Element Preference Optimization for LLM-based Web Navigation

TL;DR

This work tackles improving autonomous web navigation by exploiting HTML structure through preference learning. It proposes WEPO, a framework that uses unsupervised sampling of non-salient HTML elements as negative samples and trains with Direct Preference Optimization to align model actions with user intent. On the Mind2Web benchmark, WEPO achieves state-of-the-art performance, outperforming baselines such as WebAgent and CogAgent and demonstrating strong generalization across domains and tasks. The results indicate that contrastive, preference-based fine-tuning can substantially enhance web-page-based task execution, with promising future directions including HTML-specific encoders and scalability to longer contexts.

Abstract

The rapid advancement of autonomous web navigation has significantly benefited from grounding pretrained Large Language Models (LLMs) as agents. However, current research has yet to fully leverage the redundancy of HTML elements for contrastive training. This paper introduces a novel approach to LLM-based web navigation tasks, called Web Element Preference Optimization (WEPO). WEPO utilizes unsupervised preference learning by sampling distance-based non-salient web elements as negative samples, optimizing maximum likelihood objective within Direct Preference Optimization (DPO). We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over WebAgent and 5.3% over the visual language model CogAgent baseline. Our findings underscore the potential of preference optimization to enhance web navigation and other web page based tasks, suggesting a promising direction for future research.

Paper Structure

This paper contains 10 sections, 1 equation, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Web Element Preference Optimization (WEPO). Given user intent, Find me a M2 Mac Air Laptop with 15" screen, WEPO combines the correct element (marked in green) with heuristic rule-based sampled negative elements (marked in red) to construct preference pairs. This process utilizes the maximum likelihood objective function proposed in algorithms such as DPO to fine-tune the language model, thereby enhancing its accuracy in element discrimination and selection.
  • Figure 2: Statistical distribution of Element Distance for different models (Llama3-8B, Mistral-7B and Gemma-2B) on the test dataset. As the model size increases, the relative deviation in element distances decreases.
  • Figure 3: Ablation studies on the negative ratio. We experimented with the Llama3-8B-WEPO model at $n = 1, 3, 5$ and calculated the average SSR (%) on three cross-test sets, which were $57.7\%$, $63.5\%$ for distance-based sampling and $56.6\%$, $61.1\%$, and $62.4\%$ for random sampling, respectively. An elbow point was observed at $n = 3$ for random sampling, where the increase in SSR sharply levels off. Furthermore, the performance of distance-based sampling at a 1:3 ratio has already surpassed that of random sampling at a 1:5 ratio by 1.1%.
  • Figure 4: A collection of correct and incorrect actions generated by WEPO models. The correct cases are within the green boxes on the left, and the incorrect cases are within the red boxes on the right. It can be seen that for correct generations, if the action is a CLICK, the element ID must be the same, and for TYPE and SELECT, the textual value must also be identical. Therefore, the common causes of errors include incorrect element locating or inconsistencies in textual content.
  • Figure 5: Statistical visualization of HTML snippet lengths and predefined action proportions in the Mind2Web dataset. We randomly selected eight subsets for display. The red dashed line represents the upper limit of the context window length for mainstream open-source LLMs at 128k tokens, while the gray dashed line indicates the average token length of HTML snippets. In the proportional bar charts on the right, purple corresponds to CLICK, green to TYPE, and yellow to SELECT.
  • ...and 2 more figures