Table of Contents
Fetching ...

AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Runyuan Ma, Chenlin Su, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He

TL;DR

This work argues that HTML-to-text extraction quality is a crucial, underexplored factor in large-scale pretraining. It introduces MinerU-HTML, a model-based, semantically aware HTML extractor that converts raw HTML into AI-ready Main-HTML and a structured intermediate representation for Markdown formatting. By scaling through template-aware generalization, MinerU-HTML enables web-scale extraction, yielding AICC, a 7.3T-token multilingual corpus that preserves complex content like formulas, code blocks, and tables. Controlled pretraining experiments show models trained on AICC outperform those trained on heuristic-extracted data, demonstrating that extraction quality can rival aggressive filtering strategies. The authors publicly release WebMainBench, MinerU-HTML, and AICC to promote research on semantic-aware web data curation and its impact on downstream model capabilities.

Abstract

While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

TL;DR

This work argues that HTML-to-text extraction quality is a crucial, underexplored factor in large-scale pretraining. It introduces MinerU-HTML, a model-based, semantically aware HTML extractor that converts raw HTML into AI-ready Main-HTML and a structured intermediate representation for Markdown formatting. By scaling through template-aware generalization, MinerU-HTML enables web-scale extraction, yielding AICC, a 7.3T-token multilingual corpus that preserves complex content like formulas, code blocks, and tables. Controlled pretraining experiments show models trained on AICC outperform those trained on heuristic-extracted data, demonstrating that extraction quality can rival aggressive filtering strategies. The authors publicly release WebMainBench, MinerU-HTML, and AICC to promote research on semantic-aware web data curation and its impact on downstream model capabilities.

Abstract

While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

Paper Structure

This paper contains 59 sections, 4 equations, 23 figures, 8 tables.

Figures (23)

  • Figure 1: Overview of the MinerU-HTML Core Extraction Pipeline. The pipeline consists of three stages: (1) Pre-processing: Raw HTML is transformed into two parallel representations—Simplified HTML (streamlined input for the model with reduced tokens) and Mapping HTML (preserving original structure for faithful reconstruction). (2) Content Classification: MinerU-HTML-Classifier (0.6B parameter LM) performs sequential block classification on the simplified input, with a custom logits processor implementing constrained decoding to ensure structured JSON output without hallucination. (3) Post-processing: Predicted labels ("main" or "other") are used to select corresponding blocks from the Mapping HTML, yielding the final Main-HTML as a valid DOM subtree of the original document.
  • Figure 2: Iterative improvement pathways for MinerU-HTML. MinerU-HTML follows a virtuous cycle: the model-based extractor can be systematically improved by collecting more training data (including failure cases), retraining on expanded datasets, and leveraging advances in base model capabilities. This makes MinerU-HTML's approach inherently more scalable and future-proof as language models continue to advance.
  • Figure 3: Length ratio distribution between AICC and TfCC documents. Positive values indicate AICC extracts more content.
  • Figure 4: Extraction quality vs. length ratio. Pairwise win rates (AICC vs. TfCC) judged by DeepSeek-Chat-V3 on 10,000 stratified samples. A sharp crossover at ratio = 0 (red dashed line) reveals that when AICC extracts more content (positive ratios), it is preferred in 75–98% of comparisons; when it extracts less (negative ratios), TfCC is preferred in 51–92% of cases.
  • Figure 5: Training dynamics across 13 benchmarks for models pretrained on AICC and TfCC. Average accuracy at 15 checkpoints (4B–63B tokens). AICC consistently maintains superior or competitive performance throughout training.
  • ...and 18 more figures