Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, Boris Ginsburg
TL;DR
Genetic-Instruct introduces an evolutionary-inspired pipeline to generate large-scale synthetic coding instruction datasets using three specialized LLMs: Instructor-LLM for instruction creation, Coder-LLM for code generation, and Judge-LLM for quality assessment. The framework runs in parallel across multiple colonies, employing mutation and crossover, followed by code generation, fitness evaluation, and a decontamination step, yielding over 7.5 million samples from 512 seeds. Models trained with these synthetic instructions show significant improvements on standard code-generation benchmarks, outperforming several baseline synthetic-generation methods and public datasets. The authors also provide ablations validating the complementary benefits of mutation and crossover and release the dataset to support open-source LLM development.
Abstract
Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.
