Table of Contents
Fetching ...

Unearthing Large Scale Domain-Specific Knowledge from Public Corpora

Zhaoye Fei, Yunfan Shao, Linyang Li, Zhiyuan Zeng, Conghui He, Qipeng Guo, Hang Yan, Dahua Lin, Xipeng Qiu

TL;DR

The paper presents Retrieve-from-CC, a pipeline that uses LLM-driven query expansion to automatically generate domain-relevant queries and BM25-based retrieval from Common Crawl to assemble Retrieve-Pile, a large, domain-diverse corpus. Models trained or further pre-trained on Retrieve-Pile (notably Llama2-QoC and Mistral-QoC) show notable gains on mathematics and knowledge-reasoning benchmarks such as MATH, GSM8K, MMLU, AGIEval, and BBH, with contamination largely controlled. The authors provide extensive analyses of data domain distribution, web-domain composition, and data quality, arguing that automated retrieval can match or exceed manually curated datasets in educational value while reducing costs. They also discuss limitations related to data quality in public corpora and potential hallucination risks, suggesting directions for improving data filtering and retrieval fidelity.

Abstract

Large language models (LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl (CC), a large public corpus. We refer to this approach as Retrieve-from-CC. It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Retrieve-Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of , Retrieve-from-CC can effectively retrieve relevant data from the covered knowledge domains and significantly improve the performance in tests of mathematical and knowledge-related reasoning abilities. We have released Retrieve-Pile at https://huggingface.co/datasets/Query-of-CC/Retrieve-Pile.

Unearthing Large Scale Domain-Specific Knowledge from Public Corpora

TL;DR

The paper presents Retrieve-from-CC, a pipeline that uses LLM-driven query expansion to automatically generate domain-relevant queries and BM25-based retrieval from Common Crawl to assemble Retrieve-Pile, a large, domain-diverse corpus. Models trained or further pre-trained on Retrieve-Pile (notably Llama2-QoC and Mistral-QoC) show notable gains on mathematics and knowledge-reasoning benchmarks such as MATH, GSM8K, MMLU, AGIEval, and BBH, with contamination largely controlled. The authors provide extensive analyses of data domain distribution, web-domain composition, and data quality, arguing that automated retrieval can match or exceed manually curated datasets in educational value while reducing costs. They also discuss limitations related to data quality in public corpora and potential hallucination risks, suggesting directions for improving data filtering and retrieval fidelity.

Abstract

Large language models (LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl (CC), a large public corpus. We refer to this approach as Retrieve-from-CC. It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Retrieve-Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of , Retrieve-from-CC can effectively retrieve relevant data from the covered knowledge domains and significantly improve the performance in tests of mathematical and knowledge-related reasoning abilities. We have released Retrieve-Pile at https://huggingface.co/datasets/Query-of-CC/Retrieve-Pile.
Paper Structure (42 sections, 9 figures, 6 tables)

This paper contains 42 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparation of traditional manual data collection methods with our approach.
  • Figure 2: The overview of Retrieve-from-CC's two major components: Query Expanding and Data Retrieval.
  • Figure 3: The category distribution of the query for $\mathsf{Retrieve}$-$\mathsf{Pile}$.
  • Figure 4: Left: The frequency distribution of the documents number across URL domains, with most domains having few documents, while a small number have many. The y-axis uses a logarithmic scale to highlight this imbalance. this means Retrieve-from-CC not only retrieve the data from high knowledge density websites like Wikipedia but collect data from scatted websites. Right: The timestamp statistics of $\mathsf{Retrieve}$-$\mathsf{Pile}$, most data of $\mathsf{Retrieve}$-$\mathsf{Pile}$ come from recent years (different colors represent different years).
  • Figure 5: The distribution of QuRating wetting24qurating of $\mathsf{Retrieve}$-$\mathsf{Pile}$, The Pile, and selected high-quality subsets of The Pile. QuRating is a robust metric designed to evaluate data quality across four dimensions, with higher scores indicating better quality. Following wetting24qurating, the scores are normalized to have a mean of zero and a standard deviation of one for all displayed data.
  • ...and 4 more figures