Table of Contents
Fetching ...

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen

TL;DR

UnitCoder tackles the code data quality bottleneck by mining pre-training code and validating synthesis with model-generated unit tests. Its three-stage pipeline—Data Preparation, Fix and Refine Flow, and Post-Train—produces a dataset of over 500K verifiable Python programs across hundreds of packages, enabling effective post-training of open-source bases. Empirical results on BigCodeBench, HumanEval, and MBPP show consistent gains, particularly for API-heavy tasks, with substantial improvements for Llama3.1-8B and InternLM-2.5-7B on complex package usage. The work demonstrates a scalable, verification-driven approach to data synthesis that leverages unit tests for both guidance and validation, highlighting the importance of test-driven data curation in code generation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31\% and 28\% to 40\% and 39\% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (https://github.com).

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

TL;DR

UnitCoder tackles the code data quality bottleneck by mining pre-training code and validating synthesis with model-generated unit tests. Its three-stage pipeline—Data Preparation, Fix and Refine Flow, and Post-Train—produces a dataset of over 500K verifiable Python programs across hundreds of packages, enabling effective post-training of open-source bases. Empirical results on BigCodeBench, HumanEval, and MBPP show consistent gains, particularly for API-heavy tasks, with substantial improvements for Llama3.1-8B and InternLM-2.5-7B on complex package usage. The work demonstrates a scalable, verification-driven approach to data synthesis that leverages unit tests for both guidance and validation, highlighting the importance of test-driven data curation in code generation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31\% and 28\% to 40\% and 39\% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (https://github.com).

Paper Structure

This paper contains 29 sections, 1 equation, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: The UnitCoder pipeline. The pipeline consists of three main stages: (1) Data Preparation - filter package-centric data from raw code corpus and fine-tune a unit test generator to produce corresponding tests; (2) Fix and Refine Flow - execute function-test pairs in sandbox, iteratively fix failed cases via bug-fix agent, and refine successful code through refine agent; (3) Post-Train - construct prefix-completion pairs for post-training.
  • Figure 2: Scaling Effects of Synthetic Data: As the scale of synthetic data (measured in tokens) increases, we observe a corresponding growth in both the diversity of unique packages in synthetic data and InternLM2.5-7B's performance on BigCodeBench after post-training.
  • Figure 3: The distribution of packages in filtered code data, grouped by usage frequency. Usage represents the frequency of package imports, and Percentage shows the percentage of package types within each frequency group relative to the total number.