Unearthing Large Scale Domain-Specific Knowledge from Public Corpora
Zhaoye Fei, Yunfan Shao, Linyang Li, Zhiyuan Zeng, Conghui He, Qipeng Guo, Hang Yan, Dahua Lin, Xipeng Qiu
TL;DR
The paper presents Retrieve-from-CC, a pipeline that uses LLM-driven query expansion to automatically generate domain-relevant queries and BM25-based retrieval from Common Crawl to assemble Retrieve-Pile, a large, domain-diverse corpus. Models trained or further pre-trained on Retrieve-Pile (notably Llama2-QoC and Mistral-QoC) show notable gains on mathematics and knowledge-reasoning benchmarks such as MATH, GSM8K, MMLU, AGIEval, and BBH, with contamination largely controlled. The authors provide extensive analyses of data domain distribution, web-domain composition, and data quality, arguing that automated retrieval can match or exceed manually curated datasets in educational value while reducing costs. They also discuss limitations related to data quality in public corpora and potential hallucination risks, suggesting directions for improving data filtering and retrieval fidelity.
Abstract
Large language models (LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl (CC), a large public corpus. We refer to this approach as Retrieve-from-CC. It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Retrieve-Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of , Retrieve-from-CC can effectively retrieve relevant data from the covered knowledge domains and significantly improve the performance in tests of mathematical and knowledge-related reasoning abilities. We have released Retrieve-Pile at https://huggingface.co/datasets/Query-of-CC/Retrieve-Pile.
