Table of Contents
Fetching ...

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li

TL;DR

EpiCoder tackles the bottleneck of diversity and complexity in seed-data for code generation by introducing a feature-tree based synthesis framework that organizes high-level code abstractions into a hierarchical structure. The method evolves this tree to broaden coverage, reweights and samples features to generate instruction data of varying difficulty, and uses iterative refinement to ensure correctness, enabling function-, file-, and potentially repository-level code synthesis. Empirical results show state-of-the-art performance for EpiCoder variants on multiple benchmarks and demonstrate superior complexity and diversity of the synthesized data, including cross-file dependency handling and large-scale repository-like generation. The work highlights practical advances in producing high-quality, diverse, and scalable training data for code LLMs, with implications for real-world software development workflows and future research in repository-level code synthesis.

Abstract

Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.

EpiCoder: Encompassing Diversity and Complexity in Code Generation

TL;DR

EpiCoder tackles the bottleneck of diversity and complexity in seed-data for code generation by introducing a feature-tree based synthesis framework that organizes high-level code abstractions into a hierarchical structure. The method evolves this tree to broaden coverage, reweights and samples features to generate instruction data of varying difficulty, and uses iterative refinement to ensure correctness, enabling function-, file-, and potentially repository-level code synthesis. Empirical results show state-of-the-art performance for EpiCoder variants on multiple benchmarks and demonstrate superior complexity and diversity of the synthesized data, including cross-file dependency handling and large-scale repository-like generation. The work highlights practical advances in producing high-quality, diverse, and scalable training data for code LLMs, with implications for real-world software development workflows and future research in repository-level code synthesis.

Abstract

Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.
Paper Structure (53 sections, 1 equation, 12 figures, 15 tables, 2 algorithms)

This paper contains 53 sections, 1 equation, 12 figures, 15 tables, 2 algorithms.

Figures (12)

  • Figure 1: Benchmark performance of EpiCoder-Qwen-7B (fine-tuned on Qwen2.5-Coder-7B-Base) and its counterparts. XFileDep is file-level code generation benchmark, all others are function-level.
  • Figure 2: Overview of our feature tree-based code generation framework, which consists of three steps: (a) Feature Tree Extraction, where we first extract the feature set to construct the tree structure demonstration and then extract the feature trees; (b) Feature Tree Evolution, where the feature tree is iteratively expanded in depth and breadth; and (c) Feature Tree-Based Code Generation, where the evolved feature tree is used to generate diverse code instruction data. A detailed example of feature evolution and code generation is shown in Appendix \ref{['sec:ap-method']}.
  • Figure 3: Model performance across domains of Python in the English Subset of FullStackBench.
  • Figure 4: Pass@1 (%) results of different LLMs on XFileDep computed with greedy decoding.
  • Figure 5: An example of our repo-level code generation. The left part shows the original LLaMA-Factory repository structure, the middle part presents the structure of LLMTune, which we generated based on the extracted feature tree, and the right part illustrates an example file from the generated repository.
  • ...and 7 more figures