Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
Mengjie Liu, Jiahui Peng, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Wenchang Ning, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He
TL;DR
Dripper tackles the challenge of token-efficient main HTML extraction by recasting the problem as semantic block classification on a simplified HTML representation processed by a small decoder-only LM. It introduces four innovations: HTML simplification to reduce input length, block-wise sequence labeling to lower inference cost, a constrained decoding mechanism with a logits processor to prevent hallucinations, and WebMainBench, a large, richly annotated benchmark. Empirical results show that Dripper-0.6B achieves state-of-the-art ROUGE-N F1 scores on WebMainBench and generalizes well across established benchmarks, with a beneficial fallback to traditional tools for coverage. The work demonstrates that token-efficient, structured prediction with a small LM can surpass larger generative approaches for web content extraction and provides a public benchmark and codebase to advance offline data pipelines for AI training data.
Abstract
Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.
