Table of Contents
Fetching ...

XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler

Yu Li, Bryce Wang, Xinyu Luan

TL;DR

XPath Agent introduces a two-stage, LLM-driven approach to XPath generation tailored for web crawling and GUI testing. By extracting target information with cue texts and sanitizing HTML, it condenses content and enables robust XPath programming through a condenser, static XPath generation, and a conversational evaluator. Empirical results on the SWDE dataset show XPath Agent achieves competitive performance with reduced token usage and faster execution compared to state-of-the-art baselines, validating its practical utility in production workflows. The work advances the deployment-readiness of LLM-based web information extraction by focusing on efficiency, generalization, and seamless workflow integration, with code available at the project repository.

Abstract

We present XPath Agent, a production-ready XPath programming agent specifically designed for web crawling and web GUI testing. A key feature of XPath Agent is its ability to automatically generate XPath queries from a set of sampled web pages using a single natural language query. To demonstrate its effectiveness, we benchmark XPath Agent against a state-of-the-art XPath programming agent across a range of web crawling tasks. Our results show that XPath Agent achieves comparable performance metrics while significantly reducing token usage and improving clock-time efficiency. The well-designed two-stage pipeline allows for seamless integration into existing web crawling or web GUI testing workflows, thereby saving time and effort in manual XPath query development. The source code for XPath Agent is available at https://github.com/eavae/feilian.

XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler

TL;DR

XPath Agent introduces a two-stage, LLM-driven approach to XPath generation tailored for web crawling and GUI testing. By extracting target information with cue texts and sanitizing HTML, it condenses content and enables robust XPath programming through a condenser, static XPath generation, and a conversational evaluator. Empirical results on the SWDE dataset show XPath Agent achieves competitive performance with reduced token usage and faster execution compared to state-of-the-art baselines, validating its practical utility in production workflows. The work advances the deployment-readiness of LLM-based web information extraction by focusing on efficiency, generalization, and seamless workflow integration, with code available at the project repository.

Abstract

We present XPath Agent, a production-ready XPath programming agent specifically designed for web crawling and web GUI testing. A key feature of XPath Agent is its ability to automatically generate XPath queries from a set of sampled web pages using a single natural language query. To demonstrate its effectiveness, we benchmark XPath Agent against a state-of-the-art XPath programming agent across a range of web crawling tasks. Our results show that XPath Agent achieves comparable performance metrics while significantly reducing token usage and improving clock-time efficiency. The well-designed two-stage pipeline allows for seamless integration into existing web crawling or web GUI testing workflows, thereby saving time and effort in manual XPath query development. The source code for XPath Agent is available at https://github.com/eavae/feilian.

Paper Structure

This paper contains 26 sections, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: XPath Agent of two stages pipeline. The first stage is Information Extraction, which extracts target information and cue text from sanitized web pages (the red are sanitized). The second stage is XPath Programming, which generates XPath queries based on condensed html (the greens are target nodes) and generated XPath.
  • Figure 2: Token Stats Analysis with Algorithm 1. As page size grow, the size after sanitization increased slowly (sampled 128 pages for each category from SWDE dataset, around 10k pages totally).