Table of Contents
Fetching ...

Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM

Mengjie Liu, Jiahui Peng, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Wenchang Ning, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He

TL;DR

Dripper tackles the challenge of token-efficient main HTML extraction by recasting the problem as semantic block classification on a simplified HTML representation processed by a small decoder-only LM. It introduces four innovations: HTML simplification to reduce input length, block-wise sequence labeling to lower inference cost, a constrained decoding mechanism with a logits processor to prevent hallucinations, and WebMainBench, a large, richly annotated benchmark. Empirical results show that Dripper-0.6B achieves state-of-the-art ROUGE-N F1 scores on WebMainBench and generalizes well across established benchmarks, with a beneficial fallback to traditional tools for coverage. The work demonstrates that token-efficient, structured prediction with a small LM can surpass larger generative approaches for web content extraction and provides a public benchmark and codebase to advance offline data pipelines for AI training data.

Abstract

Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.

Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM

TL;DR

Dripper tackles the challenge of token-efficient main HTML extraction by recasting the problem as semantic block classification on a simplified HTML representation processed by a small decoder-only LM. It introduces four innovations: HTML simplification to reduce input length, block-wise sequence labeling to lower inference cost, a constrained decoding mechanism with a logits processor to prevent hallucinations, and WebMainBench, a large, richly annotated benchmark. Empirical results show that Dripper-0.6B achieves state-of-the-art ROUGE-N F1 scores on WebMainBench and generalizes well across established benchmarks, with a beneficial fallback to traditional tools for coverage. The work demonstrates that token-efficient, structured prediction with a small LM can surpass larger generative approaches for web content extraction and provides a public benchmark and codebase to advance offline data pipelines for AI training data.

Abstract

Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.

Paper Structure

This paper contains 26 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An overview of the Dripper framework, which operates as a three-stage pipeline. (1) Pre-processing: A raw HTML document is converted into two parallel representations: Simplified HTML for model input and Mapping HTML for final reconstruction. (2) Dripper-0.6B Extraction: Dripper-0.6B performs sequential block classification on the simplified input, guided by a custom logits processor to output a structured sequence. (3) Post-processing: The labels are used to select the corresponding blocks from Mapping HTML to construct the final, clean Main Content.
  • Figure 2: Impact of the logits processor on performance across various training data scales.
  • Figure 3: Screenshot of the web page annotation tool. The main content selection is highlighted in blue on the left, with a real-time preview on the right.
  • Figure 4: An example data from WebMainBench. It includes the raw source, the ground-truth main HTML, its Markdown conversion, and a rich set of metadata for fine-grained analysis.
  • Figure 5: Prompt template for Main HTML classification.