OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, Boris Ginsburg
TL;DR
OpenCodeInstruct introduces a 5-million-sample, open-access dataset for coding instruction tuning, combining large-scale synthetic generation (via Genetic-Instruct), two seed streams (algorithmic and generic), unit-test feedback, and LLM-based quality judgments. Fine-tuning Llama3 and Qwen2.5-Coder across multiple scales on this dataset yields substantial gains on standard benchmarks (HumanEval, MBPP, LiveCodeBench, BigCodeBench) compared to instruction-tuned baselines. The work provides extensive analyses on data scaling, seed diversity, instruction generation methods, and NL-to-Code versus Code-to-Code prompting, offering practical guidelines for future code-LLM instruction-tuning efforts. By releasing both dataset and methodological insights, OpenCodeInstruct aims to accelerate progress in open-source code LLMs and reproducible research in the field.
Abstract
Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.
