Table of Contents
Fetching ...

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pour Ansari, Fartash Faghri

TL;DR

This work shows that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially, and suggests a simple intervention: by taking a Union over different extractors, this can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance.

Abstract

One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

TL;DR

This work shows that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially, and suggests a simple intervention: by taking a Union over different extractors, this can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance.

Abstract

One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.
Paper Structure (33 sections, 10 figures, 8 tables)

This paper contains 33 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Different extractors lead to different final pages. The Venn diagram shows the overlap in resulting pages that come from applying the DCLM-Baseline pipeline to the outputs of three different initial extractors. 61% of pages are uniquely kept for just one extractor.
  • Figure 2: (Left) Analysis of extractor imbalance across domains. We group pages from all three extractors (Top 11%) by domain. For every domain with at least 50 pages, we compute the maximum ratio represented by any one extractor, plotting the distribution above. For 26.7% and 7.6% of domains respectively, at least 60% and 80% of surviving pages come from just one extractor. (Right) Does higher yield lead to better performance when data-constrained? We train 1B-5x models on smaller subsamples of data curated from the 1B-1x raw pool. The Union datasets that yield more tokens are able to achieve better performance. See \ref{['tab:union_data_constrained']} for extended results.
  • Figure 3: Extractor performance remains consistent across serialization formats. We plot the WikiTQ performance stratified across different test-time table serializations. Despite resiliparse and trafilatura producing tables that most closely match "concat" and "markdown" formats respectively, we observe a surprising degree of generalization across most serializations (with the exception of "json").
  • Figure 4: Extraction comparison for mutual fund data. We use difflib to visualize pairwise comparisons between resiliparse (left) and jusText (top) or trafilatura (bottom). For both tables in this page, jusText removes them while trafilatura applies markdown formatting.
  • Figure 5: Extraction comparison for an BioMed Central article. We use difflib to visualize pairwise comparisons between resiliparse (left) and jusText (top) or trafilatura (bottom). Note that for the table shown, jusText removes it while trafilatura applies markdown formatting. Here, resiliparse ends up splitting entries across line breaks instead of single spaces.
  • ...and 5 more figures