Intellecta Cognitiva: A Comprehensive Dataset for Advancing Academic Knowledge and Machine Reasoning
Ajmal PS, Ditto PS, Jithin VG
TL;DR
Intellecta tackles the paucity of reasoning-focused datasets by constructing a large-scale synthetic-plus-textbook resource totaling 11.53B tokens, with 8.01B synthetic and 3.52B textbook data. The approach leverages Mixtral-8x7B-Instruct-v0.1 to generate both advanced thought processes and textbook-style explanations, underpinned by rigorous curation (OCR ingestion, deduplication via Simhash, toxicity screening with Perspective API, and topic clustering via DBSCAN). Empirical evaluation with a 634M-parameter boomer model trained on 11.5B tokens demonstrates competitive cross-domain performance on benchmarks like ARC and HellaSwag, supporting claims of robust generalization from a comparatively compact model. The work contributes a scalable, ethically curated dataset with broad educational coverage and provides prompts and methods to reproduce thought-process and textbook data generation, highlighting potential for more efficient reasoning-enabled language models in real-world educational and cognitive tasks.
Abstract
Intellecta dataset emerges as an innovative synthetic dataset, engineered to enhance the cognitive processing capabilities of contemporary language models. With a composition of 11.53 billion tokens, integrating 8.01 billion tokens of synthetic data with 3.52 billion tokens of rich textbook data, Intellecta is crafted to foster advanced reasoning and comprehensive educational narrative generation. Leveraging the Mixtral-8x7B-Instruct-v0.1 model, the dataset facilitates the generation of complex thought processes and detailed, textbook-style explanations, thus enabling language models to engage in both critical thinking and profound educational discourse. This hybrid dataset stands as a testament to the potential of synthetic data in pushing the boundaries of AI, offering a repository that is not only vast and varied but also refined to align with ethical standards and intellectual rigor.
