Table of Contents
Fetching ...

Open Artificial Knowledge

Vadim Borisov, Richard H. Schreiber

TL;DR

The Open Artificial Knowledge (OAK) project tackles data scarcity and privacy challenges in training large language models by introducing a large-scale, open, synthetic dataset. It combines topic extraction from Wikipedia with subtopic expansion (via GPT-4o) and dual prompt-generation strategies (programming and meta-prompt engineering) to generate prompts that drive text creation from open-source LLMs. The pipeline yields hundreds of thousands of subtopics and a multi-model, multi-domain corpus exceeding 500 million tokens, with toxicity filtering, privacy safeguards, and community-driven evaluation plans. By releasing OAK publicly and outlining ongoing updates and ethical considerations, the work aims to facilitate model alignment, fine-tuning, and benchmarking while promoting reproducibility and responsible AI research.

Abstract

The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training, and it is freely available on www.oakdataset.org.

Open Artificial Knowledge

TL;DR

The Open Artificial Knowledge (OAK) project tackles data scarcity and privacy challenges in training large language models by introducing a large-scale, open, synthetic dataset. It combines topic extraction from Wikipedia with subtopic expansion (via GPT-4o) and dual prompt-generation strategies (programming and meta-prompt engineering) to generate prompts that drive text creation from open-source LLMs. The pipeline yields hundreds of thousands of subtopics and a multi-model, multi-domain corpus exceeding 500 million tokens, with toxicity filtering, privacy safeguards, and community-driven evaluation plans. By releasing OAK publicly and outlining ongoing updates and ethical considerations, the work aims to facilitate model alignment, fine-tuning, and benchmarking while promoting reproducibility and responsible AI research.

Abstract

The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training, and it is freely available on www.oakdataset.org.
Paper Structure (12 sections, 9 figures, 2 tables)

This paper contains 12 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of the Open Artificial Knowledge (OAK) dataset generation pipeline. The process begins with extracting general topics from extensive human knowledge databases such as Wikipedia and GPT-4o models. These high-level and sub-level topics are then used in an automatic prompt generation step, which employs two methods: meta prompt engineering using large language models (LLMs) and cost-effective programming prompt engineering. The generated prompts are subsequently fed into state-of-the-art open-source LLMs (at the time of writing, five models were used: Llama3-8B, Llama-70B, Mixtral7x8B, Gemma-7B team2024gemma, and Gemma-2-9B gemma_2024) to create the OAK dataset.
  • Figure 2: Pseudocode for the dynamic prompt engineering algorithm using code: This algorithm generates diverse and contextually rich prompts for the OAK dataset by leveraging Wikipedia topics, randomized analysis types, and varied response lengths. It combines elements of randomization, topic selection, and template-based prompt construction to create a wide range of prompts suitable for synthetic data generation.
  • Figure 3: Subtopic Expansion Prompt: Prompt used to generate detailed subtopics from high-level topics.
  • Figure 4: Meta Prompt: This prompt guides the creation of detailed and high-quality responses tailored to specific topics, following predefined criteria for quality, length, and style.
  • Figure 5: A random sample from the OAK dataset, generated using the Llama-8b model. The ellipsis (...) denotes text that has been omitted for brevity.
  • ...and 4 more figures