Table of Contents
Fetching ...

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini

TL;DR

This work introduces WebOrganizer, a two-dimensional framework that partitions web-scale pre-training data into topic and format domains to illuminate corpus composition and enable principled data curation. By distilling a large LLM into compact topic and format classifiers, the authors annotate a 200B token CommonCrawl-derived corpus and optimize domain mixtures with RegMix to improve downstream tasks like MMLU and HellaSwag. They show that domain mixing, including combinations of topics and formats, can outperform or complement existing quality-filter methods, and that quality filters themselves induce implicit domain shifts. The study also analyzes the relationship between domain structure and data quality, demonstrating that domain-aware curation provides transparency and practical gains, while acknowledging limitations and areas for future refinement. Open-sourcing WebOrganizer and the associated data and annotations aims to advance transparent, data-centric pre-training for large language models.

Abstract

Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

TL;DR

This work introduces WebOrganizer, a two-dimensional framework that partitions web-scale pre-training data into topic and format domains to illuminate corpus composition and enable principled data curation. By distilling a large LLM into compact topic and format classifiers, the authors annotate a 200B token CommonCrawl-derived corpus and optimize domain mixtures with RegMix to improve downstream tasks like MMLU and HellaSwag. They show that domain mixing, including combinations of topics and formats, can outperform or complement existing quality-filter methods, and that quality filters themselves induce implicit domain shifts. The study also analyzes the relationship between domain structure and data quality, demonstrating that domain-aware curation provides transparency and practical gains, while acknowledging limitations and areas for future refinement. Open-sourcing WebOrganizer and the associated data and annotations aims to advance transparent, data-centric pre-training for large language models.

Abstract

Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Paper Structure (47 sections, 2 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 47 sections, 2 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: We construct topic domains (left) and format domains (right) to organize pre-training corpora. The areas visualize the number of tokens per domain in a cleaned pre-training corpus based on CommonCrawl. See \ref{['app:domain_descriptions']} for detailed definitions of the categories. We provide an interactive explorer of the domains at https://weborganizer.allen.ai.
  • Figure 2: We visualize the 15 highest co-occurrences in the normalized pointwise mutual information (NPMI) matrix between topics (y-axis) and formats (x-axis). \ref{['fig:pmi_topic_type_full']} shows the full matrix, where most entries are close to zero.
  • Figure 3: The corpus proportions of our topic domains (left) and formats (right), and the training mixtures predicted by RegMix for targeting MMLU, HellaSwag, and both tasks. Numerical values can be found in \ref{['tab:mixtures_weights']} in the appendix.
  • Figure 4: The implicit domain compositions from quality filtering compared to the corpus distribution for topic domains (left) and format domains (right). We include the RegMix prediction tailored to both MMLU and HellaSwag from \ref{['fig:mixtures_regmix']} to facilitate comparison. Numerical values can be found in \ref{['tab:mixtures_weights']} in the appendix.
  • Figure 5: Frequency statistics of URL domain names in our 200B CommonCrawl corpus. Left: Plotting log document frequency vs. the log rank of the domain name exhibits Zipfian long-tail behavior. Middle and right: We list the most common domain names (left) and a random sample of domains between 100-100K documents (right). We plot statistics after removing any sub-domains, i.e. en.wikipedia.org $\to$ wikipedia.org.
  • ...and 3 more figures