Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Somshubra Majumdar; Vahid Noroozi; Mehrzad Samadi; Sean Narenthiran; Aleksander Ficek; Wasi Uddin Ahmad; Jocelyn Huang; Jagadeesh Balam; Boris Ginsburg

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, Boris Ginsburg

TL;DR

Genetic-Instruct introduces an evolutionary-inspired pipeline to generate large-scale synthetic coding instruction datasets using three specialized LLMs: Instructor-LLM for instruction creation, Coder-LLM for code generation, and Judge-LLM for quality assessment. The framework runs in parallel across multiple colonies, employing mutation and crossover, followed by code generation, fitness evaluation, and a decontamination step, yielding over 7.5 million samples from 512 seeds. Models trained with these synthetic instructions show significant improvements on standard code-generation benchmarks, outperforming several baseline synthetic-generation methods and public datasets. The authors also provide ablations validating the complementary benefits of mutation and crossover and release the dataset to support open-source LLM development.

Abstract

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 9 figures, 3 tables)

This paper contains 21 sections, 9 figures, 3 tables.

Introduction
Previous Works
Genetic-Instruct
Mutation Operation
Crossover Operation
Code Generation
Fitness Function
Scaling Up the Process
LLM Decontamination
Experiments
Experimental Settings
Performance Evaluation
Ablation Study
Influence of the Generator Model
Conclusion
...and 6 more sections

Figures (9)

Figure 1: The overall process of Genetic-Instruct across multiple parallel colonies per generation. Each colony begins with a small seed population, from which an Instructor-LLM applies crossover and mutation to create new instructions. A Coder-LLM then generates corresponding code solutions, which are evaluated by a Judge-LLM for correctness and quality. Once the target population size is reached, samples are decontaminated to form the final population.
Figure 2: The accuracy of Llama-3.1-8B trained on different data sizes. Code accuracy is calculated as the average of the model's accuracy on all the four benchmarks. With scaling up the synthetic, accuracy improves but starts to show diminishing improvements later.
Figure 3: Prompt template for mutation operation
Figure 4: Prompt template for the crossover operation with few-shot in-context learning
Figure 5: Prompt template for code Generation with Coder-LLM
...and 4 more figures

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

TL;DR

Abstract

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)