Table of Contents
Fetching ...

An Index-based Approach for Efficient and Effective Web Content Extraction

Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao

TL;DR

This work tackles the challenge of extracting task-relevant web content for large language models under large page volumes and limited context windows. It introduces an index-based extraction paradigm that partitions HTML into addressable blocks, assigns numeric indices, and trains a dedicated IndexLM to predict content-relevant indices, thereby decoupling latency from the amount of content retrieved. The approach achieves higher accuracy and faster extraction than existing methods, both when used as a post-retrieval filter in RAG QA systems and in direct extraction tasks, and is supported by specialized datasets and training procedures. The results indicate substantial practical benefits for web-enabled AI agents and point to future extensions including reinforcement learning optimization and broader domain applications.

Abstract

As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.

An Index-based Approach for Efficient and Effective Web Content Extraction

TL;DR

This work tackles the challenge of extracting task-relevant web content for large language models under large page volumes and limited context windows. It introduces an index-based extraction paradigm that partitions HTML into addressable blocks, assigns numeric indices, and trains a dedicated IndexLM to predict content-relevant indices, thereby decoupling latency from the amount of content retrieved. The approach achieves higher accuracy and faster extraction than existing methods, both when used as a post-retrieval filter in RAG QA systems and in direct extraction tasks, and is supported by specialized datasets and training procedures. The results indicate substantial practical benefits for web-enabled AI agents and point to future extensions including reinforcement learning optimization and broader domain applications.

Abstract

As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.

Paper Structure

This paper contains 38 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Index-based extraction offers faster speed compared to token-by-token generative extraction.
  • Figure 2: Comparison of index-based web content extraction and previous works. Chunk and rerank RAG methods are unable to perform main content extraction, while methods based on heuristic rules have difficulty with query-relevant extraction. In comparison to other methods, our approach is both effective and efficient.
  • Figure 3: The complete process of Index-based Web Content Extraction.
  • Figure 4: An example of Index-based Web Content Extraction. Text with a green background represents query-relevant content, while red indicates the opposite. The original content will be mapped to an index, and finally, the query-relevant index will be mapped back to the original text blocks. A more detailed real example is shown in Appendix \ref{['example']}.
  • Figure 5: Our method consistently outperforms previous works across all context length limits. The curve's stability between 0.5K and 4K suggests that the query-relevant information for most queries is under 512 tokens, and that our approach is able to extract it precisely.
  • ...and 4 more figures