Open Artificial Knowledge
Vadim Borisov, Richard H. Schreiber
TL;DR
The Open Artificial Knowledge (OAK) project tackles data scarcity and privacy challenges in training large language models by introducing a large-scale, open, synthetic dataset. It combines topic extraction from Wikipedia with subtopic expansion (via GPT-4o) and dual prompt-generation strategies (programming and meta-prompt engineering) to generate prompts that drive text creation from open-source LLMs. The pipeline yields hundreds of thousands of subtopics and a multi-model, multi-domain corpus exceeding 500 million tokens, with toxicity filtering, privacy safeguards, and community-driven evaluation plans. By releasing OAK publicly and outlining ongoing updates and ethical considerations, the work aims to facilitate model alignment, fine-tuning, and benchmarking while promoting reproducibility and responsible AI research.
Abstract
The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training, and it is freely available on www.oakdataset.org.
