Table of Contents
Fetching ...

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, Boris Ginsburg

TL;DR

Genetic-Instruct introduces an evolutionary-inspired pipeline to generate large-scale synthetic coding instruction datasets using three specialized LLMs: Instructor-LLM for instruction creation, Coder-LLM for code generation, and Judge-LLM for quality assessment. The framework runs in parallel across multiple colonies, employing mutation and crossover, followed by code generation, fitness evaluation, and a decontamination step, yielding over 7.5 million samples from 512 seeds. Models trained with these synthetic instructions show significant improvements on standard code-generation benchmarks, outperforming several baseline synthetic-generation methods and public datasets. The authors also provide ablations validating the complementary benefits of mutation and crossover and release the dataset to support open-source LLM development.

Abstract

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

TL;DR

Genetic-Instruct introduces an evolutionary-inspired pipeline to generate large-scale synthetic coding instruction datasets using three specialized LLMs: Instructor-LLM for instruction creation, Coder-LLM for code generation, and Judge-LLM for quality assessment. The framework runs in parallel across multiple colonies, employing mutation and crossover, followed by code generation, fitness evaluation, and a decontamination step, yielding over 7.5 million samples from 512 seeds. Models trained with these synthetic instructions show significant improvements on standard code-generation benchmarks, outperforming several baseline synthetic-generation methods and public datasets. The authors also provide ablations validating the complementary benefits of mutation and crossover and release the dataset to support open-source LLM development.

Abstract

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.
Paper Structure (21 sections, 9 figures, 3 tables)

This paper contains 21 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The overall process of Genetic-Instruct across multiple parallel colonies per generation. Each colony begins with a small seed population, from which an Instructor-LLM applies crossover and mutation to create new instructions. A Coder-LLM then generates corresponding code solutions, which are evaluated by a Judge-LLM for correctness and quality. Once the target population size is reached, samples are decontaminated to form the final population.
  • Figure 2: The accuracy of Llama-3.1-8B trained on different data sizes. Code accuracy is calculated as the average of the model's accuracy on all the four benchmarks. With scaling up the synthetic, accuracy improves but starts to show diminishing improvements later.
  • Figure 3: Prompt template for mutation operation
  • Figure 4: Prompt template for the crossover operation with few-shot in-context learning
  • Figure 5: Prompt template for code Generation with Coder-LLM
  • ...and 4 more figures